Any uncertainty reasoning system should be able to give quantitative answers to the following type of questions from its users:
These questions all involve measuring information, i.e., the system must have computational procedures to reduce the available information (stored in the form of possibly high-dimensional PDFs) to one single scalar number: the measure of the information.
Fact 2 (Lossyness of information measures) Calculating the information content in a PDF is an irreversible operation: it looses information in the transformation from the high-dimensional PDF, representing all information about the system, to the scalar information measure.
Consequently, it is impossible to reconstruct the original PDF from knowing only the scalar that quantifies its overall information. Of course, an infinite number of ways exist to reduce a PDF to one single number. But history has provided us with a (small) number of choices that have interesting and “natural” properties. Entropy is one of the most “pure” measures, because, as this Section will illustrate (without proofs), it can be derived from a minimal and very plausible set of desirable information measure properties.
Let’s formally denote by
the information about model
that can be
derived from the evidence (or, information)
, and given the “background”
(“context”) information
. Good [20] proposed the following structural properties
for
:
.
The information about
derived from both
and
is a function
of the information about
derived from
only, and the extra
information given by
, interpreted in the context that
is already
taken for granted.
simply represents a functional relationship between the different terms,
without saying anything about the exact form of this relationship.
.
If one knows already everything about
, other information cannot add
anything anymore.
is a strictly increasing function of its arguments.
If the information content of one of the parameters of
increases, the
information
increases too.
if
and
are mutually
irrelevant pieces of information.
.
Considering the information contained in
doesn’t increase the total
information if this information was already incorporated.
The details of the Good’s derivation are skipped in this text, but Good
found in a rather straightforward way that these specifications lead to
many alternatives for the representation of information. And he also
proved that, if information
is represented by a measurable
function
, then composition of information becomes additive if and only if
is any
function of
, or
. The simplest choice being, of course,
,
which is the rationale behind the abundance of logarithms in statistics, for example
in the information measures discussed in the following Subsections. Indeed, addition
is the natural operator on the space of these logarithms, and is easy to work
with.
This Section explains why Shannon introduced one particular logarithm-based function as a very attractive measure of information. Claude Elwood Shannon (1916–2001), [8, 48, 49], presented a scalar “average” (or “expected”) measure to quantify the quality of communication channels, i.e., their capacity to transmit information. However, his measure also qualifies as a fully general information measure, as was first explained by Jaynes [26]. Shannon gave his measure the name of entropy, because it models the similar concept with the same name in thermodynamics: the higher the entropy of a thermodynamic system, the higher our uncertainty about the state of the system, or, in other words, the higher its “disorder.” Note that entropy is a “subjective” feature of the system: it represents the knowledge (or uncertainty) that the observer has of the system, but it is not a physical property of the system itself. However, it is “objective” in the sense that each observer comes to the same conclusion when given the same information.
Shannon’s reasoning went as follows. Assume the parameter
takes one of the
values from a set
, with
the corresponding
probability distribution. The following three properties are straightforward, plausible
desiderata for any information measure
of the probability distribution
:
| Axioms for entropy
| |
| I |
|
| II | If all |
| III |
|
The first and second specifications model our intuition that (i) small changes in probability imply only small changes in entropy, and (ii) our uncertainty about the exact value of a parameter increases when it is a member of a larger group.
The third desideratum is represented mathematiclly as follows: the information
measure
obeys the following additive composition law:
| H(p1,…,pn) = | H(w1,w2,…) | (9) | |
| + w1H(p1∣w1,…,pk∣w1) | |||
| + w2H(pk+1∣w2,…,pk+m∣w2) + …, |
is the probability of the set
,
is the probability of the
set
, and so on, Figure 5;
is the probability of the
alternative
if one knows that the parameter
comes from the set that has
probability
.
|
For example, assume that
comes from a set of three members, with the
alternatives occurring with probabilities
and
, respectively. If one then
groups the second and third alternatives together (i.e.,
, the probability
of the set
, and
, the probability of the set
), the composition law gives
since
and
are the probabilities of
and
within the set
.
The three above-mentioned axioms suffice to derive an analytical expression for
the information measure function
. The first axiom implies that it is
sufficient to determine
for rational values
(with
integer numbers) only; the reason is that the rational numbers are a
dense subset of the real numbers. One then uses the composition law to
find that
can be found from the uniform probability distribution
over
alternatives. Indeed, the composition law
says that the entropy
is equal to the entropy
, because in
one
can group the first
alternatives, the following
alternatives, and so
on, which reduces to the original distribution. For example, let
and
such that
; denoting
by
yields

In general, this could be written as
![]() | (10) |
The special case of all
equal to the same integer
gives

A solution to this equation is given by
, with
because
of the monotonicity rule. All this yields the following expression for the
entropy:
| H(p1,…,pn) | = K ln - K ∑
pi ln(ni), | (11) |
= -K ∑
pi ln . | (12) | |
| = -K ∑ pi ln(pi). | (13) |
Note that
, because
. The minus sign in the entropy (or
information) makes the entropy measure positive, and increasing when uncertainty
increases; this is the same interpretation as in statistical mechanics, the science that
originally defined the concept of entropy. The constant
has no influence: it is
nothing but a factor that sets the scale of the entropy. The entropy need not be a
monotonically decreasing function of the amount of information received: entropy can
increase with new information (“evidence”) coming in, if this new information
contradicts the previous assumptions. Note also that the uncertainty in many
dynamic systems increases naturally over the time period that no new information is
received from the system: the probability distributions “flatten out” and hence the
entropy increases.
Figure 6 gives examples of entropy functions for various simple PDFs. It shows that entropy corresponds, to some extent, to our intuition about information: a PDF with a sharp peak has more information than one with a broad peak. A PDF with much variation has a lower information than one without much variation. A PDF with many alternating peaks and valleys has the same information as one which as all peaks and valleys assembled together.
Note also that the information measure can never reach “absolute zero,” because
this would require
to vanish, which can only occur for trivial PDfs
with one single element.
Fact 4 (Comparison of information in PDFs) The number of samples as
well as the (arbitrary) scaling constant
make comparisons of the absolute
values of the entropies of two PDFs quite useless.
At first sight, extending Eq. (14) from discrete to continuous PDFs seems straigthforward:
![]() | (15) |
Example. The entropy of the multivariate,
-dimensional Gaussian distribution is,
[8]:
![]() | (16) |
with
the covariance matrix of the Gaussian PDF. For a one-dimensional
Gaussian distribution, this becomes:
![]() | (17) |
with
the standard deviation, i.e., the square root of the one-dimensional
covariance.
The extension in Eq. (15) is well-defined for a number of distributions, but not
for all of them. This can be seen from the reasoning that led to that formula, because
there the value
of the entropy of a uniform distribution is used.
And this uniform distribution is not well defined for all continuous PDFs
running over all real values. A second look at Eq. (12) suggests another
interpretation:
can be taken in each interval of the PDF
parameter(s).
![]() | (18) |
where
is the density of the continuous PDF that we obtain from
“taking the limit” of the discrete PDF
, and simularly for
and
.
Hence, only relative information measures are possible. This is consistent with the fact
that there is no absolute zero information, to which the “distance” of a PDF
could be taken.
Fact 5 (Mutual information—Kullback-Leibler divergence) The relative
information measure (often called ” mutual information”)
of two continuous
PDFs
and
is defined as:
![]() | (19) |
This scalar is also called the Kullback-Leibler divergence, (after the duo that first
presented it, [35], [36]), or also mutual entropy, or cross entropy, of both probability
measures
and
, [35, 45].
It is a (coordinate-independent) measure for how much information one needs to
add to the probability distribution
in order to obtain the probability
distribution
. As was the case with Shannon’s entropy, also
is a
global measure, since all information contributions
are weighted by
, and then added together.
The Kullback-Leibler divergence of Eq. (19) is a good measure of information (it is positive, and it can be proven to obey, in some cases, the triangle inequality, [1]), but it is not a distance function (or “metric”) on the space of all PDFs, since it is not symmetric in its arguments:
![]() | (20) |
Rao [44] was the first to come up with a real distance function on the manifold
of probability distributions
over the state space
and described by a
parameter vector
.
Fact 6 (PDF manifold)
is a parameterized space of PDFs, which is
not the same space as the state space of the system on which the PDF is defined.
The PDF manifold is smooth (i.e., infinite derivatives are possible), so one can define
tangent vectors
to the manifold
as follows:
![]() | (21) |
The
are the coordinates of the tangent vector in the basis formed by the
tangent vectors of the logarithms of the
-coordinates. A metric
at the
point
(which is a probability distribution) is a bilinear mapping that
gives a real number when applied to two (logarithmic) tangent vectors
and
attached to
. Rao showed that the covariance of both vectors
satisfies all properties of a metric. Hence, the elements
of the matrix
representing the metric are found from the covariance of the coordinate tangent
vectors:
![]() | (22) |
The matrix
got the name Fisher information matrix. The covariance “integrates
out” the dependency on the state space coordinates
, hence the metric is
only a function of the statistical coordinates
. This metric is defined on
the tangent space to the manifold
of the
-parameterized family of
probability distributions over the
-parameterized state space
. Kullback
and Leibler already proved the following relationship between the relative
entropy of two “infinitely separated” probability distributions
and
on the one hand, and the Fisher Information matrix
on the other
hand:
![]() | (23) |
Hence, Fisher Information represents the local behaviour of the relative entropy: it indicates the rate of change in information in a given direction of the probability manifold (not in a given direction of the state space!). In other words, it measures the sensitivity of the variance to small changes in the mean.
The Gaussian PDFs are among the most widely used PDFs, because of
their mathematical simplicity. They also offer simple information measures,
based on the covariance matrix
. Equation (4) is repeated here for
convenience:

The argument of the exponential function is a real number. Hence,
![]() | (24) |
can be considered as the magnitude of the state space “vector”
, and so,
(or,
rather, its inverse) is a metric on the state space.
This covariance matrix
is a function of the variables on the PDF parameter
space, i.e., mean
and variance
.
Fact 7 (Generalized least-squares) Equation (24) generalizes the
well-known least-squares criterion of measuring the deviation between a state
space vector
and another state space vector
.
The covariance matrix itself is a matrix, so in general it contains more than one single scalar value. Therefore, the following scalar measures are often derived from it:
Fact 8 (Arbitrariness of scalar Gaussian measures) None of the above-mentioned measures derived from the covariance matrix of a Gaussian PDF has the status of a natural or absolute information measure.
Fact 9 (Fisher Information of a Gaussian PDF) It can be proven that the Fisher Information of a Gaussian distribution gives the distribution’s covariance matrix.
Every piece of software, hence also every set of software agents that together implement an “intelligent” knowledge system, must start from a certain initial state. In the context of plausible inference, one is often tempted to describe this initial state as the state in which the knowledge system knows “nothing” yet. This raises the two questions as to (i) what such a total ignorance really means, and (ii) how to represent it. Since many researchers didn’t find satisfactory answers to these two questions, they jumped to the conclusion that the Bayesian paradigm is not a valid framework for reasoning under uncertainty. However, the fact that they didn’t find satisfactory answers says much about their own state of ignorance, since ignorance can be dealt with in a very clean and formal way. (But, it is true, not always in a simple way.) The French scientist Pierre Simon de Laplace (1749–1827) was the first to propose the uniform distribution of a parameter as the state of ignorance about its exact value. This approach of assigning an a priori distribution this way later got the name of Laplace’s principle of indifference. Harold Jeffreys [31] generalized this, and presented the invariant volume form (Sect. 2.3) as the non-informative prior distribution. However, Jeffreys’ suggestion was not accepted by the then active community of statisticians, so his ideas were not widespread. A similar fate befell Edwin Thompson Jaynes (1922–1998), in the 50s, 60s and 70s, although he did add mathematical rigour to the somewhat intuitive ideas of Jeffreys, [26, 46].
Let’s first discuss the question about what total ignorance means. In fact, it doesn’t mean much: one always knows something about the system one is interested in; or, at least, one could come up with some models, even though one would have no idea about the values of the parameters in it. Or, in the words of Jaynes: “merely knowing the physical meaning of our parameters [in a model], already constitutes highly relevant prior information which our intuition is able to use at once” (emphasis is Jaynes’s).
The question about which “ignorance priors” (or “noninformative priors” as they are often also called) to choose has still not been answered completely satisfactorily: Jeffreys’ non-informative prior distribution works only for location parameters, such as the mean value of a parameter. For other properties, such as e.g. the standard deviation, other ignorance prior distributions are needed. Jaynes’s approaches to find ignorance priors are [27, 28, 29, 30]:
does
not necessarily lead to a uniform distribution on
.
(Side note: well-known statistical properties such as unbiasedness or the mean square error are not invariant with respect to coordinate transformations.)
A motivation for choosing maximum entropy as a selection criteria to define prior knowledge (“priors”) follows directly from the interpretation of extropy as a measure of information: the prior should introduce as little ”new” information as possible, taking into account the constraints that are imposed in the prior (by the physics of the problem, or, more often, by the arbitrary choice of the human in the selection of the family of allowed prior PDFs).
There still exists discussion about the appropriateness of these approaches; much of this controversy is caused by the fact that most researchers do still not understand the importance of structure and invariance for a mathematical framework that wants to be consistent.
Sections 3.1 and explained how the ratios of the logarithms of two probability
density functions are invariant measures for information. Revisiting Bayes’ rule in
this context shows that it is a procedure to combine information from two sources
without loss of information: the first source is the prior information already
contained in the current state, and the second source is the new information
added by the current measurement. This relationship is straightforward:
take Bayes’ rule for two models
and
that receive the same new
:

Taking the logarithms of the ratio of both relationships yields
![]() | (25) |
The left-hand side is the information measure after the measurement; the right-hand side represents the information measures of the contributions of both sources. Hence:
Fact 10 (Bayes’ rule as optimal information processor) Bayes’ rule has equal information ”before” and ”after” it has been applied. Hence, it is optimal in the sense that it doesn’t add nor delete information, [56].