The Robotics WEBook

An online textbook about robots and other mechatronic systems

Bayesian probability theory

(LaTeX | PDF | HTML | Figures )

This page gives a high-level overview of the concepts, algorithms and systematic procedures used within the Bayesian probability paradigm of information processing, reasoning and decision making under uncertainty. Each of the topics described on this page are explained in more detail in the other Sections of this text.

A paradigm is the set of models, thought patterns, techniques, practices, beliefs, systematic procedures, terminology, notations, symbols, implicit assumptions and contexts, values, performance criteria, ..., shared by a community of scientists, engineers and users in their modelling, analysis, design and application of systems. So, a paradigm is a subjective, collective, cognitive but often unconscious view shared by a group of humans, about how the world works.

The good news is that practitioners within the same paradigm need very few words to communicate, because they share lots of implicit background knowledge and terminology. The bad news is that practitioners from two different paradigms often find it difficult to understand each other’s reasoning and to appreciate each other’s procedures and results. This situation is still particularly relevant in most areas of artificial intelligence, and the topics presented in this Chapter belong to those areas where heated inter-paradigm debates occur quite often. The reader should be aware of this situation, and read this Chapter’s presentation of the Bayesian paradigm with an open mind.

So, this Chapter explains the Bayesian paradigm, and tries to use as few different concepts as possible in doing so. It speaks most often of Bayesian probability (theory), mainly because its terminology borrows extensively from statistical probability theory. But “theory” is an understatement, in that the Bayesian paradigm contains much more than just theory: it provides a systematic approach to the theoretical and practical aspects of tackling real-world problems in which one is confronted with incomplete knowledge about the world.

A (very short) historical overview

The historical roots of the Bayesian probability theory lie in the late 18th, early 19th century, with Thomas Bayes (1702–1761) and Pierre-Simon de Laplace (1749–1827). The theory was mainly targeted at problems of gambling and insurance. It didn’t make it into a real scientific discipline at that time, because the problem of motivating and constructing prior probabilities was not adequately answered, and it was even “forgotten” for a long time. Bayesian probability theory only began to be re-appreciated in different application domains (first in physics and later in related sciences such as astronomy) during various periods of the 20th century, against a scientific background that was, by then, more or less monopolised by traditional statistics. Physicists like Edwin Thompson Jaynes (1922–1998) and Richard Threlkeld Cox (1889–1991) were instrumental in the reincarnation of Bayesian theory, by defending the “extreme” rigour of the Bayesian approach against a hostile statistics community, and by creating many of the solid theoretical foundations of the Bayesian paradigm as we know it now.

Overview of the Bayesian paradigm

The major goals of a knowledge system (not just of Bayesian knowledge systems!) are:

All the reasoning about the system is done under uncertainty: numerical conclusions about the system are (usually) not given as logical, binary numbers, which are either true or false. On the contrary, each conclusion is accompanied with a measure of its uncertainty. In other words, a measure of how much information has been collected to support the conclusion.

This text prefers to speak about “information” instead of “uncertainty,” the former being a more general and application-neutral term. Other commonly used terms for the same concept are belief and evidence.

It’s always risky to summarize a paradigm in only a couple of paragraphs, so the following should only be interpreted as a “first-order approximation” that gives the reader an intuitive understanding of the essential aspects of the Bayesian paradigm:

The Bayesian paradigm has received a very thorough mathematical foundation, which has led to the following facts:

These facts are given without proof in this text; the interested reader is refered to the literature for more mathematical details. The Bayesian paradigm also has a very complete and thorough geometrical interpretation, which will be used here and there in this text, again without proofs are details.

1.3 Information = Probability Density Function

The first step in creating a Bayesian solution for a particular application is to select an appropriate model for the application domain. “Appropriate” means: detailed enough for the goals of the user, but not more detailed than necessary. Such models correspond to the physical and mathematical “state space” models of the domain that physicists or control engineers tend to make, and adds an information model to it. The latter model is the choice of a parameterized family of probability density functions (“PDF”) over the domain under study, Fig. 1 . The requirements for a function over the domain to be acceptable as a PDF are extremely simple:

The fact that probability values lie between 0 and 1, and that the integral of a PDF over its domain is 1, are arbitrary choices of convenience, without real meaning. Of course, one should make the same choice for all PDFs involved in a Bayesian information processing algorithm.

Once the information, from whatever source, is incorporated into a PDF, the history about how the information was collected is completely forgotten. In the Bayesian paradigm, the history is also not relevant in the interpretation of the information in a PDF. If, for one reason or another, users of the knowledge system are interested in the history, they have to extend the system’s state with variables that encode that history.


One-dimensional probability distributions over a continuous
           (left) and a discrete (right) domain PIC

Figure 1: One-dimensional probability distributions over a continuous (left) and a discrete (right) domain.

1.4 Analogies with thermodynamics/statistical mechanics

Many scientific paradigms use terminology from, and emphasize similarities with, well-known domains from physics, in order to make the paradigm more intuitive to newcomers. For Bayesian probability theory, thermodynamics and statistical mechanics are the physical domains from which most analogies and intuition can be drawn. (However, familiarity with these domains is not a requirement for understanding the Bayesian paradigm, or this document.) The following similarities exist:

Note that information as well as thermodynamics are special with respect to other physical domains (mechanical, electrical, hydraulical, ...), because they have no “kinetic energy” concept, just “potential energy.”

1.5 Hidden and observable parameters

Most knowledge systems of real practical interest have in common that the sources of information that one has about the system cover only a (small) part of the parameters that influence the behaviour of the system. The parameters that one can “measure” directly are often called observable parameters; the ones for which no “sensors” exist are called hidden (or latent) variables. Typically, the hidden variables are the parameters in a mathematical model of the system under study.

Most of the successful (because computationally tractable) applications have a graphical model structure in which the causal relationships from hidden to observable parameters are “easy” to calculate.

1.6 Data association

Many real-world knowledge systems are confronted with the data association problem: a new measurement arrives in the system, and must be used to update the information about the “appropriate” hidden variables in the system. But finding out which hidden variables are appropriate is not an easy task. For example, imagine a robot driving around in a big city, and observing a traffic light. Measuring its distance and orientation with respect to that light does not give the robot information about where in the city it currently is. In order to solve that latter problem, the robot must have a “map”’ of the city, which is by definition a mathematical model with only hidden variables.

Data association occurs in most problems that require some sort of hierachical decomposition of their system model. Other often occuring terminology is: “local” and “global” models; “metric” and “topological” maps; multi-resolution models; etc.

1.7 Bayes’ rule for information processing

The following relationship is the basis of all information processing in the Bayesian paradigm, and is attributed to Thomas Bayes (although he never wrote it down in this precise form) and given his name:

 p(Data|Model,H)- p(Model|Data,H) = p(Data |H) p(Model |H).
(1)

The notation p(A|B) in the formula above stands for: The conditional probability of A given B. In other words: The information (”probability”) about A , conditional on B, or ”All we know about A if all knowledge about B is taken into account.” The textual reading of the Bayes’ rule is as follows:

The new information that one gets about the Model, given new Data and the background Hypotheses is the product of three contributions: (i) the prior information that one has already about the model, (ii) the likelihood of the Data (i.e., the information about the predicted Data given the information in the current model), and (iii) (the inverse of) the information about the Data, irrespective of which Model is used to interpret them.

This particular representation of Bayes’ rule already uses very suggestive names for the variables in the PDFs: “Model,” “Data” and background information (“hypotheses”) “H.” (Most publications and applications “forget” to make the background information explicit…) The reason for this choice of terminology is that many problems in Bayesian probability are basically some form of “model fitting” (“pattern recognition,” “pattern matching”), i.e., one wants to find out which Model fits best to the measured Data. In practice, the “forward” probability p(Data|Model) (i.e., predicting the Data that can be expected given the current knowledge about the Model) is often easier to calculate than the “inverse” probability p(Model|Data,H) (i.e., adapting the knowledge about the Model based on the measured Data) in which the human user is most interested. So, Bayes’ rule presents a formally simple (but computationally often quite expensive) relationship to “invert” the information from the observable to the hidden parameters. (This usage of Bayes’ rule gave rise to the original 18th and 19th century name of Bayesian probability theory: inverse probabilities.) In other words, Bayes’ rule represents model-based learning.

Figure 2 shows Bayes’ rule in action, applied to simple Gaussian probability density functions.

 


Bayes' rule on Gaussian PDFs

Figure 2: Example of Bayes’ rule, applied to two Gaussian PDFs.

 


1.8 Graphical models for information structuring

Real-world knowledge systems are seldom simple enough to keep track of the dependencies between different “sources” and “sinks” of information without computer support. In the Bayesian paradigm, graphical models are the traditional tool to provide this computer support.

The concept of a graph is used in many scientific disciplines as a tool to convey precise meaning about a system in a “user-friendly” way; the edges of the graph denote relationships between the parameter values stored in the graph’s nodes. Bayesian theory is no exception to this custom of using graphs to model the structural relationships in a given information processing system: a node contains the information about a set of domain variables, and the edges denote conditional (in)dependence between the variables in the connected nodes.

Building a graphical model is often the first step in the systematic approach taught in the Bayesian paradigm. Making such a graphical model is always an approximation of the real-world system under study, and deciding to which level of detail and accuracy one must build a model is still more of an art than a science. The Bayesian paradigm uses the following basic types of graphs:

Bayesian Network.
This is the name given to the Directed Acyclic Graphs (DAC), in which each arrow represents information causality: if the graphical model has an arrow pointing from the node A to the node B, the information in node A can be calculated first, before its influence on the information in B is required in subsequent calculations.
Markov Random Field.
This is the name given to undirected graphs that represent mutual influence between parameters in the knowledge system. “Undirected” and “mutual” mean that the information in node A influences the information in node B, but also vice versa.

Later sections will give more details, but Figure 3 illustrates the basic idea of a Bayesian Network: the variables X, Y and Z are connected by a PDF p(X, Y,Z); the network represents the fact that X and Y are independent, given the information about the variable Z. This structure in the dependencies can be exploited to simplify marginalization operations on p(X,Y,Z).


PIC

Figure 3: Smallest element in a Bayesian network: X and Y are conditionally independent, given Z.

 


DACs are computationally more efficient than general graph structures, because all computations to be done on the graph can be serialized on the basis of the loop-free property of the graph: the arrows in the network all point “in the same direction,” and hence they introduce a natural computational causality, i.e, an ordering in the execution of all computations.

The disadvantage of using a DAC graphical model is that knowledge base designers are sometimes tempted to put too much structure on the system model (solely for computational reasons), such that some conditional dependencies are introduced in the model while they do not exist in the real-world system being modelled.

DACs are particularly appropriate for dynamic systems: the information at this time instant “causes” the information at the next time instant, but never the other way around. In other words, time puts a natural linear order on the information processing computations. Examples of dynamic systems are: tracking and steering the motion of rockets, airplanes and robots; analysis and prediction of time series, such as stock exchange ratings in economy, or demographic data in sociology.

The mutual influence between nodes in a MRF results in computationally more costly algorithms, because computations cannot be serialized as easily as in the DAC case. Examples of knowledge systems for which MRFs are a more natural representation than DACs are: face recognition (information about one of the eyes, nose, mouth, hair, etc., influences information about the other components in a face); weather forecasting; matching of different medical scans (X-ray, MR, ultrasound, etc.) on top of each other; etc.

Because computations on graphically modelled systems are common in various science and engineering domains (e.g., physics, multi-body dynamical systems, finite element methods, ...), many of the computational developments in the Bayesian paradigm have turned out to be copies of what has been developed independently in those other domains, and vice versa. Dynamic programming is the generic common computational framework. Further cross-fertilization with these other domains has still not been explored to its full extent.

1.9 Information transformation

The previous Sections introduced modelling, structuring and processing of information, but knowledge systems also require some types of information transformation:

The loss of information that goes together with most transformations of information is often not recognized explicitly by the users or designers of knowledge systems. This can lead to mis-interpretation of the results.

1.10 Decision making—Action selection

Gathering and processing information is a key activity in a Bayesian knowledge system, but most systems have only practical usefulness if they can use the processed information to make decisions about which actions to perform on the system under study. One common approach to Bayesian decision making is to extend the graphical model of the system with appropriate decision making nodes: these nodes “tap” the information provided by the model nodes they are connected to, and use that information in their utility function. Such utility function is a trade-off, between, on the one hand, the cost of an action, and, on the other hand, the gain that can be expected from the action, based on the current information available about the system.

1.11 Model selection/Hypothesis test/Model building

Knowledge systems often have to compare different possible models of the system under study with respect to each other, in order to select the “best” one for a given purpose. The Bayesian paradigm uses the Occam’s razor idea, to trade-off goodness of fit with complexity of the model. The goodness of fit criterion quantifies how well a model can explain the data that have been received about the system; the complexity criterion quantifies the preference for simpler models.

In addition to just comparing existing models, “very” intelligent systems can be expected to generate (“to build”) new models autonomously, if they detect that (i) the available models are not good enough to explain the data, or that (ii) they can be replaced by simpler models with the same expressive power. Generally speaking, such model building consists of two parts, the second of which reduces the model building problem to a special case of model selection:

There is a fundamental difference in complexity between creating a new model by (i) selecting appropriate values in a given (set of) graphical models, and (ii) finding a new graphical model that fits the data best. The latter activity is more “intelligent,” and hence (or rather, because) it is computationally more expensive.

1.12 Examples

The following paragraphs briefly explain a very small set of application examples for which Bayesian probability theory has proven to be appropriate. Two examples are worked out in somewhat more detail, in order to get a flavor for the Bayesian approach to uncertainty.

Inference.
(That is, processing of information, from “sources” to “goals.”)
  • Reasoning in court trials. Judges and juries must make a decision about the guilt of the accused, based on the “evidence” that is presented to them.
  • Weather forecasting. The climate system of the earth is too complex to predict with just deterministic application of the laws of physics. In addition, one does not have full information about temperature, pressure, wind velocity, ..., at all placed in the world. So, the predictions are model-based, but with significant amount of incomplete (model and data) information.
  • Medical diagnostic systems. The problems to be solved, as well as the resource available, are to a large extent very similar to the weather forecasting example: to decide about the health and possible treatments of patienst, based on too complex models with incomplete measurement data.
(Parameter) estimation.
(That is, the collected data is transformed into one single scalar, based on which decisions about the system are made.)
  • Police speed radar. The police decides to give you a fine, on the basis of the speed of your car at a given moment and a given place. However, all speed measurements are indirectly derived from measurements on physical properties (e.g., phase shift, doppler effect, time-of-flight, ...). Their modeled relationships to the car speed depend on many influencing parameters, hence the incomplete data and model information problem occurs here too.
  • Prediction of celestial body motions. Astronomers predict the future motions of planets and spaceships on the basis of Newton’s laws, and of measurements by optical and radio telescopes. Again, the models used are limited in the number of other bodies whose influence on the motion of the body under study is taken into account; and measurements via telescopes always often have quite poor resolution.
Pattern matching/Hypothesis testing.
(That is, finding the model that best fits the data.)
  • Detection of tumor in a scan image. Tumors show up in many forms of medical scan images, but other tissues also produce image features with very similar properties. The decision about whether or not a tumor is involved is based on incomplete information, and requires the comparison of various possible explanations of the same image features.
  • Detection of gene sequences. Similar problem as above.
  • Speech and image recognition. Continuous speech and vision sensors produce huge amounts of measurement data, which is reduced into textual phrases and/or symbolic labels. There is often more than one way to reduce the data, depending on the assumed measurement-model relationships.
Model building.
(That is, deciding which new model is “most appropriate” to explain a particular set of measurement data.)
  • Reverse engineering. Many industrially produced products were first designed by artists, in clay or wood. These models are then “digitized” and transformed into CAD models. This “reverse engineering” transformation can be done in infinitely many ways. Also software “reverse engineering” exists: the workings of a given program are derived from a series of input-output relationships.
  • SLAM (Simultaneous Localization And Map building) for an autonomously navigating robot. The robot drives around in an unknown world, using information from distance sensors and (i) avoids obstacles, (ii) builds a map of the world it has explored, and (iii) decides where it is in in that map (or in a map that it got from elsewhere).

The full reverse engineering and SLAM problems have continuous as well as discrete parameters:

Continuous.
The position of the robot in its environment, or the position on a “CAD surface patch” to which a measurement on the reverse engineered object belongs.
Discrete.
Both reverse engineering and SLAM must solve the data association problem: each new measurement must be assigned to the correct surface patch, or the correct local metric map.

For example, the mobile robot doing SLAM in an environment which is significantly larger than the reach of its sensors, must solve the data association problem in the form of closing loops, i.e., it has to recognize places where it has been before. In this context, a combined metric–topological map is a natural representation: the (local) metric map contains continuous coordinates, the (global) topological map is a discrete model (a graph) of neighbouring local maps. New measurements must then “first” be associated to a local metric map, before the robot’s location in that map can be infered. The previous sentence used quotation marks around the word “first” because the Bayesian algorithm to solve this problem most often use iterations over the discrete and continuous parameter spaces. Hence, the name “Simulataneous” localization and map building.

The reverse engineering problem is conceptually and algorithmically very similar to the SLAM problem.