Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon

•Information Contents Versus Knowledge•While the information contents of a string x can be measured by its Kolmogorov complexity K(x), it is not clear how to measure the knowledge stored in x.

•We argue that the knowledge contained by a string x is relative to the hypothesis assumed to compute x.

•If H is the hypothesis used to explain x, then we suggest to measure the knowledge in x by K(H).

•The absolute knowledge in x is K(H0), where H0 is a simplest hypothesis for x.

•Using Bayes’ rule and Solomonoff’s universal distribution, we obtain

•K(x) = K(H) + K(x|H) – K(H|x).

•Here one would expect H to be consistent with x and so K(H|x) to be minimal.

•Discarding K(H|x) gives

•K(x) = K(H) + K(x|H).

•We interpret K(H) as a measure of the knowledge part in x relative to H.

•K(x|H) is a measure of the accidental information (noise) in x relative to H.

•A Simple Example•Suppose we record our observations of an ongoing phenomenon and we stop gathering data at time t1 after having obtained the segment

X = 10101010101010

•The information in x (=K(x)) is about the number of bits in a shortest program for x. Something like

For I = 1 to 7 print “10”

•This program assumes the hypothesis

H = “x contains the repeating element 10”.

It is this H that we call knowledge in x. The amount K(x|H) which is about log 7 measures the amount of noise in x under H.

•Other hypotheses exist that trade off the amount of knowledge for the level of noise that can be tolerated. This becomes application-dependent.

•This work is similar to Kolmogorov’s 1974 result in which he proposed to found statistical theory on finite combinatorial and computational principles independent of probabilistic assumptions, as the relation between the individual data and its explanation (model or hypothesis), expressed by Kolmogorov’s structure function.

Kolmogorov’s Approach to Non-probabilistic Statistics•As the relation between the individual data sample and a specific constrained data model, expressed by Kolmogorov’s structure function Φ(.).

•Let data be finite binary strings and models be finite sets of binary strings. Consider model classes consisting of models of given maximal Kolmogorov complexity.

The Kolmogorov's structure function Φ_x(k) of the given data x expresses the relation between the complexity level constraint on a model class S and the least log-cardinality of a model in the class containing the data.

Φ_x(k) = min._S {log |S|: S x, K(S) k}.

Kolmogorov explains …

To each constructive object x corresponds a function Φx(k) of a natural number k – the log of minimal cardinality of x-containing sets that allow definitions of complexity at most k. If the element x itself allows a simple definition, then the function Φ drops to 1 even for small k. Lacking such definition, the element is random in the negative sense. But it is positively probabilistically random only when function Φ having taken the value Φ0 at a relatively small k=k0, then change approximately as

Φx(k)= Φ0 – (k-k0)

•This function Φx(k), its variants, and its relation to model selection have been the subjects of numerous publications, but in my opinion it has not before been well understood.

•We view Kolmogorov’s structure function as a rewrite of Bayes’ rule using Solomonoff’s universal distribution as explained earlier.

Understanding Kolmogorov’s Structure Function•Φx(k) as the log of minimal cardinality of x-containing sets that allow definitions of complexity at most k is a particular case of K(x|H), when H is a finite set containing x and K(H) k.

• Thus we interpret Φx(k) as a measure of the amount of accidental information (noise) in x when bound to a model of maximum Kolmogorov complexity k.

•If x is typical of a finite set S, then we expect K(x|S) to be about log S.

•The terms 0 and k0 in Kolmogorov’s structure function corresponds to a hypothesis H0 of small Kolmogorov complexity k0, which explains nothing about x.

•In this case I(H0|x)=0, which leads to K(x|H0) = K(x) and K(H0|x)=K(H0)=k0. So, the approximation

K(x) = K(H) + K(x|H) – K(H|x)

or equivalently

K(x|H) = K(x) – (K(H) – K(H|x))

Would degenerate to Kolmogorov’s structure function

Φx(k)= Φ0 – (k-k0)

•In general, for all hypotheses H for x of maximum Kolmogorov complexity k, we have K(x) = K(x|H)+K(H)-K(H|x) K(x|H)+K(H) Φx(k) + k. Thus Φx(k) K(x) – k.

•This explains the act of Kolmogorov when he drew a picture of Φx(k) as a function of k monotonically approaching the diagonal (sufficiency line = L(k) = Φx(k) +k).

•This diagonal line corresponds to a minimum value of Φx(k) attained when there exists some H of max Kolmogorov complexity k such that K(x) = k + Φx(k) .

•Such k = K(H) is called a sufficient statistic for x and the expression k + Φx(k) is treated as a two-part code separating the meaningful information in x represented by k from the meaningless accidental information (noise) in x

following the hypothesis H.

•A Simple Derivation of a Fundamental Result• Vitányi’s Best Fit Function: The randomness deficiency (x|S) of a string x in the set S is defined by (x|S) = log |S| - K(x|S) for x S and otherwise.

• The minimal randomness deficiency function is x(k) = min.S{(x|S) : S x, K(S) k}.

• A model S for which x incurs x(k) deficiency is a best-fit model. We say S is optimal for x and K(S|x) 0.

•Rissanen’s Minimum Description Length Function: Consider the two-part code for x consisting of the constrained model cost K(S) and the length of the index of x in S. The MDL function is

x(k) = min.S{K(S) + log |S| : S x, K(S) k}.

•The results in [Vereshchagin and Vitányi 2004] are obtained by analysis of the relation between the three structure functions: x(k) , x(k) , and x(k) . The most fundamental result there is the equality

x(k) = x(k) + k – K(x) = x(k) – K(x),

which holds within additive terms, that are logarithmic in |x|.

•This above result is an improvement of a previous result by Gács, Tromp, and Vitányi (2001) in which it was proven that

x(k) x(k) + k – K(x) + O(1)

• where the authors mentioned that it would be nice to have an inequality also in the other direction.

•We understand the structure functions x(k) and x(k) as being equivalent to K(x|S) and K(S|x), respectively, where K(S) k.

•Using the approximation

K(x) = K(S) + K(x|S) – K(S|x)

or equivalently

K(x|S) + K(S) = K(x) + K(S|x)

gives the equality

x(k) + k = K(x) + x(k) = x(k) .

•We mention that the approach used in the previous two references uses a much more complicated argument where a shortest program for a string x is assumed to be divisible into two parts, the model part (K(S)) and the data-to-model part (K(x|S)), which is a very difficult task to do.

•Gács and Vitányi credited the Invariance Theorem for such a deep and useful fact. This view lead to the equation

•K(x) = minT{K(T) + K(x|T): T {T0, T1, …}}

Which holds up to additive constants and where T0, T1, … is the standard enumeration of Turing machines.

•The whole theory of algorithmic statistics is based on this interpretation of K(x) as the shortest length of a two-part code for x.

•We argue that the use of the Invariance Theorem to suggest a two-part code of an object is too artificial.

•We prefer to use the three-part code suggested by Bayes’ rule and think of the two-part code as an approximation of the three-part code in which the model is considered to be optimal.

Muchas Gracias

Documents

Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon