3 Modelling: measures of information

Any uncertainty reasoning system should be able to give quantitative answers to the following type of questions from its users:

These questions all involve measuring information, i.e., the system must have computational procedures to reduce the available information (stored in the form of possibly high-dimensional PDFs) to one single scalar number: the measure of the information.

Fact 2 (Lossyness of information measures) Calculating the information content in a PDF is an irreversible operation: it looses information in the transformation from the high-dimensional PDF, representing all information about the system, to the scalar information measure.

Consequently, it is impossible to reconstruct the original PDF from knowing only the scalar that quantifies its overall information. Of course, an infinite number of ways exist to reduce a PDF to one single number. But history has provided us with a (small) number of choices that have interesting and “natural” properties. Entropy is one of the most “pure” measures, because, as this Section will illustrate (without proofs), it can be derived from a minimal and very plausible set of desirable information measure properties.

3.1 Good’s information function

Let’s formally denote by I(M :E ∣C)  the information about model M  that can be derived from the evidence (or, information) E  , and given the “background” (“context”) information C  . Good [20] proposed the following structural properties for I(M :E ∣C)  :

  1. I(M :E AND  F ∣C) = f{I(M :E∣C),I(M  :F ∣E AND  C)} .

    The information about M  derived from both E  and F  is a function of the information about M  derived from E  only, and the extra information given by F  , interpreted in the context that E  is already taken for granted.

    f  simply represents a functional relationship between the different terms, without saying anything about the exact form of this relationship.

  2. I(M :E AND  M ∣C) = I(M ∣C)  .

    If one knows already everything about M  , other information cannot add anything anymore.

  3. I(M :E ∣C)  is a strictly increasing function of its arguments.

    If the information content of one of the parameters of I  increases, the information I  increases too.

  4. I(M1 AND  M2:M1 ∣C) = I(M1:M1 ∣C)  if M1  and M2  are mutually irrelevant pieces of information.
  5. I(M1 AND  M2 ∣M1  AND  C) = I(M2 ∣C)  .

    Considering the information contained in M1  doesn’t increase the total information if this information was already incorporated.

The details of the Good’s derivation are skipped in this text, but Good found in a rather straightforward way that these specifications lead to many alternatives for the representation of information. And he also proved that, if information I(M ∣C)  is represented by a measurable function1  p(M ∣C)  , then composition of information becomes additive if and only if I  is any function of log(p)  , or ln(p)  . The simplest choice being, of course, I = ln(p)  , which is the rationale behind the abundance of logarithms in statistics, for example in the information measures discussed in the following Subsections. Indeed, addition is the natural operator on the space of these logarithms, and is easy to work with.

3.2 Shannon entropy

This Section explains why Shannon introduced one particular logarithm-based function as a very attractive measure of information. Claude Elwood Shannon (1916–2001), [84849], presented a scalar “average” (or “expected”) measure to quantify the quality of communication channels, i.e., their capacity to transmit information. However, his measure also qualifies as a fully general information measure, as was first explained by Jaynes [26]. Shannon gave his measure the name of entropy, because it models the similar concept with the same name in thermodynamics: the higher the entropy of a thermodynamic system, the higher our uncertainty about the state of the system, or, in other words, the higher its “disorder.” Note that entropy is a “subjective” feature of the system: it represents the knowledge (or uncertainty) that the observer has of the system, but it is not a physical property of the system itself. However, it is “objective” in the sense that each observer comes to the same conclusion when given the same information.

Shannon’s reasoning went as follows. Assume the parameter x  takes one of the values from a set {x1,...,xn} , with p(x) = {p1,...,pn} the corresponding probability distribution. The following three properties are straightforward, plausible desiderata for any information measure H(p)  of the probability distribution p(x)  :

Axioms for entropy
I

H  is a continuous function of p  .

II

If all n  probabilities pi  are equal (and hence equal to 1∕n  , if we choose them to have to sum to 1  ), the entropy H(1∕n,...,1∕n)  is a monotonically increasing function of n  .

III

H  is an invariant, i.e., the uncertainty should not depend on how one orders or groups the elements xi  .

The first and second specifications model our intuition that (i) small changes in probability imply only small changes in entropy, and (ii) our uncertainty about the exact value of a parameter increases when it is a member of a larger group.

The third desideratum is represented mathematiclly as follows: the information measure H  obeys the following additive composition law:

H(p1,,pn) = H(w1,w2,) (9)
+ w1H(p1w1,,pkw1)
+ w2H(pk+1w2,,pk+mw2) + ,
where w1  is the probability of the set {x1,...,xk} , w2  is the probability of the set {xk+1,...,xk+m} , and so on, Figure 5; pi∣wj  is the probability of the alternative xi  if one knows that the parameter x  comes from the set that has probability wj  .


PIC

Figure 5: Grouping in a discrete probability distribution pi  , in three sub-groups. The dashed lines represent the discrete probability density function wj  of the three groups.


For example, assume that x  comes from a set of three members, with the alternatives occurring with probabilities 1∕2,1∕3  and 1∕6  , respectively. If one then groups the second and third alternatives together (i.e., w1 = p1 = 1∕2  , the probability of the set {x1} , and w2 = p2 +p3 = 1∕3+ 1∕6 = 1∕2  , the probability of the set {x2,x3} ), the composition law gives H(1∕2,1∕3,1∕6) = H(1∕2,1∕2)+ 1H(2 ∕3,1∕3)
                            2  since 2∕3  and 1∕3  are the probabilities of x2  and x3  within the set {x2,x3} .

The three above-mentioned axioms suffice to derive an analytical expression for the information measure function H(p)  . The first axiom implies that it is sufficient to determine H(p)  for rational values pi = ni∕∑n  nj
         j=1  (with nj  integer numbers) only; the reason is that the rational numbers are a dense subset of the real numbers. One then uses the composition law to find that H(p)  can be found from the uniform probability distribution P = (1∕N,...,1∕N )  over N = ∑n   ni
      i=1  alternatives. Indeed, the composition law says that the entropy H(p)  is equal to the entropy H(P )  , because in P  one can group the first n1  alternatives, the following n2  alternatives, and so on, which reduces to the original distribution. For example, let n = 3  and (n1,n2,n3) = (3,4,2)  such that N  = 3+ 4+ 2 = 9  ; denoting H(1∕N, ...,1∕N)  by H(N )  yields

  (      )
   3  4 2    3       4       2
H  9 ,9,9  + 9H(3) + 9H(4) + 9H(2) = H(9).

In general, this could be written as

             ∑              (∑    )
H(p1,...,pn)+    piH(ni) = H     ni .
(10)

The special case of all ni  equal to the same integer m  gives

H(n) + H(m) = H(mn).

A solution to this equation is given by H(n) = K ln(n)  , with K > 0  because of the monotonicity rule. All this yields the following expression for the entropy:

H(p1,,pn) = K ln(∑    )
    ni- K pi ln(ni), (11)
= -K pi ln(     )
  ∑ni--
    ni. (12)
= -K pi ln(pi). (13)

Fact 3 (Entropy of a discrete probability distribution)

|----------------------------|
|H(p ,...,p ) = - K ∑ p ln(p).|
----1-----n-----------i----i-
(14)

Note that ln(pi) < 0  , because     ∑
(ni∕  ni) < 1  . The minus sign in the entropy (or information) makes the entropy measure positive, and increasing when uncertainty increases; this is the same interpretation as in statistical mechanics, the science that originally defined the concept of entropy. The constant K  has no influence: it is nothing but a factor that sets the scale of the entropy. The entropy need not be a monotonically decreasing function of the amount of information received: entropy can increase with new information (“evidence”) coming in, if this new information contradicts the previous assumptions. Note also that the uncertainty in many dynamic systems increases naturally over the time period that no new information is received from the system: the probability distributions “flatten out” and hence the entropy increases.

Figure 6 gives examples of entropy functions for various simple PDFs. It shows that entropy corresponds, to some extent, to our intuition about information: a PDF with a sharp peak has more information than one with a broad peak. A PDF with much variation has a lower information than one without much variation. A PDF with many alternating peaks and valleys has the same information as one which as all peaks and valleys assembled together.

Note also that the information measure can never reach “absolute zero,” because this would require        ∑
piln(ni∕  ni)  to vanish, which can only occur for trivial PDfs with one single element.

Fact 4 (Comparison of information in PDFs) The number of samples as well as the (arbitrary) scaling constant K  make comparisons of the absolute values of the entropies of two PDFs quite useless.


PIC

Figure 6: Examples of entropies H  of discrete PDFs.


3.3 Entropy for continuous PDF

At first sight, extending Eq. (14) from discrete to continuous PDFs seems straigthforward:

          ∫
H(x) = - K  p(x)ln(p(x))dx.
(15)

Example. The entropy of the multivariate, n  -dimensional Gaussian distribution is, [8]:

       1                n
H(p) = 2 log det(cov(x)) + 2 log(2πe),
(16)

with cov(x)  the covariance matrix of the Gaussian PDF. For a one-dimensional Gaussian distribution, this becomes:

           √ ---
H(p) = log(σ  2πe),
(17)

with σ  the standard deviation, i.e., the square root of the one-dimensional covariance. The extension in Eq. (15) is well-defined for a number of distributions, but not for all of them. This can be seen from the reasoning that led to that formula, because there the value H(1 ∕N,...,1∕N )  of the entropy of a uniform distribution is used. And this uniform distribution is not well defined for all continuous PDFs running over all real values. A second look at Eq. (12) suggests another interpretation:

  1. the fraction    ∑
ni∕   ni  can be taken in each interval of the PDF parameter(s).
  2. the ratio of two of these ratios makes sense locally, in terms of the PDFs’ density measures, Sect 2.3:
     ni∕∑ ni    dx
mj-∕∑-mj-→  dy,
    (18)

    where dx  is the density of the continuous PDF that we obtain from “taking the limit” of the discrete PDF ni  , and simularly for dY  and mj  .

Hence, only relative information measures are possible. This is consistent with the fact that there is no absolute zero information, to which the “distance” of a PDF p(x)dx  could be taken.

Fact 5 (Mutual information—Kullback-Leibler divergence) The relative information measure (often called ” mutual information”) H(p,q)  of two continuous PDFs p(x)  and q(x)  is defined as:

|----------∫-------(----)----|
|H(p,q) = -  p(x)ln  p(x)  dx.|
|                   q(x)     |
------------------------------
(19)

This scalar is also called the Kullback-Leibler divergence, (after the duo that first presented it, [35], [36]), or also mutual entropy, or cross entropy, of both probability measures p(x)  and m(x)  , [3545].

It is a (coordinate-independent) measure for how much information one needs to add to the probability distribution m(x)  in order to obtain the probability distribution p(x)  . As was the case with Shannon’s entropy, also  (         )
H p(x):m(x) is a global measure, since all information contributions log(p(x)∕m(x))  are weighted by p(x)dx  , and then added together.

3.4 Fisher Information

The Kullback-Leibler divergence of Eq. (19) is a good measure of information (it is positive, and it can be proven to obey, in some cases, the triangle inequality, [1]), but it is not a distance function (or “metric”) on the space of all PDFs, since it is not symmetric in its arguments:

  (        )    (         )
H  p(x),m(x)  ⁄= H m(x),p(x) .
(20)

Rao [44] was the first to come up with a real distance function on the manifold ℳ
  Σ  of probability distributions p(x,Σ)  over the state space x  and described by a parameter vector Σ = {σ ,...,σ }
      1     n .

Fact 6 (PDF manifold) ℳ Σ   is a parameterized space of PDFs, which is not the same space as the state space of the system on which the PDF is defined.

The PDF manifold is smooth (i.e., infinite derivatives are possible), so one can define tangent vectors      1     n
v = (v ,...,v )  to the manifold ℳ Σ  as follows:

         ∑n
v(x,Σ) =    vi∂l(x,Σ),    with  l(x,Σ) = log(p(x,Σ)).
         i=1    ∂σi
(21)

The vi  are the coordinates of the tangent vector in the basis formed by the tangent vectors of the logarithms of the σ  -coordinates. A metric ℳ at the point p  (which is a probability distribution) is a bilinear mapping that gives a real number when applied to two (logarithmic) tangent vectors v  and w  attached to p  . Rao showed that the covariance of both vectors satisfies all properties of a metric. Hence, the elements gij  of the matrix representing the metric are found from the covariance of the coordinate tangent vectors:

       ∫ (       )(       )
gij(Σ) =   ∂il(x,Σ)  ∂jl(x,Σ) p(x,Σ)dx.
(22)

The matrix gij  got the name Fisher information matrix. The covariance “integrates out” the dependency on the state space coordinates x  , hence the metric is only a function of the statistical coordinates Σ  . This metric is defined on the tangent space to the manifold ℳ  Σ  of the σ  -parameterized family of probability distributions over the x  -parameterized state space 𝒳 . Kullback and Leibler already proved the following relationship between the relative entropy of two “infinitely separated” probability distributions Σ  and Σ + εv  on the one hand, and the Fisher Information matrix gij(Σ)  on the other hand:

             1 ∑
H(Σ:Σ + εv) = 2    gij(Σ)vivj +𝒪( ε2).
               i,j
(23)

Hence, Fisher Information represents the local behaviour of the relative entropy: it indicates the rate of change in information in a given direction of the probability manifold (not in a given direction of the state space!). In other words, it measures the sensitivity of the variance to small changes in the mean.

3.5 Measures for Gaussian PDFs

The Gaussian PDFs are among the most widely used PDFs, because of their mathematical simplicity. They also offer simple information measures, based on the covariance matrix P(μ,σ)  . Equation (4) is repeated here for convenience:

                           {                     }
          ------1-------      1       T  -1
𝒩 (μ,P) = ∘ (2π)n ∣∣P ∣∣1∕2 exp - 2(x - μ) P   (x - μ)  .

The argument of the exponential function is a real number. Hence,

1(x - μ)TP -1(x- μ)
2
(24)

can be considered as the magnitude of the state space “vector” x , and so, P (or, rather, its inverse) is a metric on the state space.

This covariance matrix P  is a function of the variables on the PDF parameter space, i.e., mean μ  and variance σ  .

Fact 7 (Generalized least-squares) Equation (24) generalizes the well-known least-squares criterion of measuring the deviation between a state space vector x  and another state space vector μ  .

The covariance matrix itself is a matrix, so in general it contains more than one single scalar value. Therefore, the following scalar measures are often derived from it:

Fact 8 (Arbitrariness of scalar Gaussian measures) None of the above-mentioned measures derived from the covariance matrix of a Gaussian PDF has the status of a natural or absolute information measure.

Fact 9 (Fisher Information of a Gaussian PDF) It can be proven that the Fisher Information of a Gaussian distribution gives the distribution’s covariance matrix.

3.6 No information—Ignorance

Every piece of software, hence also every set of software agents that together implement an “intelligent” knowledge system, must start from a certain initial state. In the context of plausible inference, one is often tempted to describe this initial state as the state in which the knowledge system knows “nothing” yet. This raises the two questions as to (i) what such a total ignorance really means, and (ii) how to represent it. Since many researchers didn’t find satisfactory answers to these two questions, they jumped to the conclusion that the Bayesian paradigm is not a valid framework for reasoning under uncertainty. However, the fact that they didn’t find satisfactory answers says much about their own state of ignorance, since ignorance can be dealt with in a very clean and formal way. (But, it is true, not always in a simple way.) The French scientist Pierre Simon de Laplace (1749–1827) was the first to propose the uniform distribution of a parameter as the state of ignorance about its exact value. This approach of assigning an a priori distribution this way later got the name of Laplace’s principle of indifference. Harold Jeffreys [31] generalized this, and presented the invariant volume form (Sect. 2.3) as the non-informative prior distribution. However, Jeffreys’ suggestion was not accepted by the then active community of statisticians, so his ideas were not widespread. A similar fate befell Edwin Thompson Jaynes (1922–1998), in the 50s, 60s and 70s, although he did add mathematical rigour to the somewhat intuitive ideas of Jeffreys, [2646].

Let’s first discuss the question about what total ignorance means. In fact, it doesn’t mean much: one always knows something about the system one is interested in; or, at least, one could come up with some models, even though one would have no idea about the values of the parameters in it. Or, in the words of Jaynes: “merely knowing the physical meaning of our parameters [in a model], already constitutes highly relevant prior information which our intuition is able to use at once” (emphasis is Jaynes’s).

The question about which “ignorance priors” (or “noninformative priors” as they are often also called) to choose has still not been answered completely satisfactorily: Jeffreys’ non-informative prior distribution works only for location parameters, such as the mean value of a parameter. For other properties, such as e.g. the standard deviation, other ignorance prior distributions are needed. Jaynes’s approaches to find ignorance priors are [27282930]:

  1. Invariance under transformations. If the only thing one knows about the system is a model or an hypothesis, this ignorance should not change if the mathematical representation of the model is transformed into an equivalent representation. For example, a uniform distribution on x  does not necessarily lead to a uniform distribution on y = x2  .

    (Side note: well-known statistical properties such as unbiasedness or the mean square error are not invariant with respect to coordinate transformations.)

  2. Maximum Entropy (”MaxEnt”) principle. If all one knows about a system is a number of constraints it has to obey, the prior distribution of the parameters describing this system is given my maximizing the entropy of the distribution taking into account the constraints. For example, if one knows the mean and variance of a parameter, the probability distribution that adds no extra a priori information (and hence has the largest entropy) turns out to be the Gaussian with given mean and variance, e.g., [5303738].

    A motivation for choosing maximum entropy as a selection criteria to define prior knowledge (“priors”) follows directly from the interpretation of extropy as a measure of information: the prior should introduce as little ”new” information as possible, taking into account the constraints that are imposed in the prior (by the physics of the problem, or, more often, by the arbitrary choice of the human in the selection of the family of allowed prior PDFs).

  3. Extra ”I don’t know” hypothesis. If the knowledge system has a set of hypotheses for the system under consideration, and it doesn’t know at all which one to prefer, or even whether one of these hypotheses is valid, it can add a new hypothesis that just says “I don’t know what hypothesis is valid.” Total ignorance is then represented by giving all the probability to this last hypothesis.

There still exists discussion about the appropriateness of these approaches; much of this controversy is caused by the fact that most researchers do still not understand the importance of structure and invariance for a mathematical framework that wants to be consistent.

3.7 Optimality of information processing

Sections 3.1 and explained how the ratios of the logarithms of two probability density functions are invariant measures for information. Revisiting Bayes’ rule in this context shows that it is a procedure to combine information from two sources without loss of information: the first source is the prior information already contained in the current state, and the second source is the new information added by the current measurement. This relationship is straightforward: take Bayes’ rule for two models M1  and M2  that receive the same new Data  :

             p(Data∣M1)
p(M1 ∣Data) = --p(Data)---p(M1),

p(M2 ∣Data) = p(Data∣M2)--p(M2).
               p(Data)

Taking the logarithms of the ratio of both relationships yields

   p(M1 ∣Data)      p(Data∣M1)      p(M1)
log p(M2-∣Data)-= logp(Data∣M2)-+ log p(M2).
(25)

The left-hand side is the information measure after the measurement; the right-hand side represents the information measures of the contributions of both sources. Hence:

Fact 10 (Bayes’ rule as optimal information processor) Bayes’ rule has equal information ”before” and ”after” it has been applied. Hence, it is optimal in the sense that it doesn’t add nor delete information, [56].