55
ARTICLE Communicated by Jean-Pierre Nadal Predictability, Complexity, and Learning William Bialek NEC Research Institute, Princeton, NJ 08540, U.S.A. Ilya Nemenman NEC Research Institute, Princeton, New Jersey 08540, U.S.A., and Department of Physics, Princeton University, Princeton, NJ 08544, U.S.A. Naftali Tishby NEC Research Institute, Princeton, NJ 08540, U.S.A., and School of Computer Science and Engineering and Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel We de ne predictive information Ipred (T) as the mutual information be- tween the past and the future of a time series.Three qualitatively different behaviors are found in the limit of large observation times T: I pred (T) can remain nite, grow logarithmically, or grow as a fractional power law. If the time series allows us to learn a model with a nite number of param- eters, then I pred (T) grows logarithmically with a coef cient that counts the dimensionality of the model space. In contrast, power-law growth is associated, for example, with the learning of in nite parameter (or non- parametric) models such as continuous functions with smoothness con- straints. There are connections between the predictive information and measures of complexity that have been de ned both in learning theory and the analysis of physical systems through statistical mechanics and dynamical systems theory. Furthermore, in the same way that entropy provides the unique measure of available information consistent with some simple and plausible conditions, we argue that the divergent part of Ipred (T) provides the unique measure for the complexity of dynam- ics underlying a time series. Finally, we discuss how these ideas may be useful in problems in physics, statistics, and biology. 1 Introduction There is an obvious interest in having practical algorithms for predicting the future, and there is a correspondingly large literature on the problem of time-series extrapolation. 1 But prediction is both more and less than extrap- 1 The classic papers are by Kolmogoroff (1939, 1941) and Wiener (1949), who essentially solved all the extrapolation problems that could be solved by linear methods. Our under- Neural Computation 13, 2409–2463 (2001) c ° 2001 Massachusetts Institute of Technology

Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

ARTICLE Communicated by Jean-Pierre Nadal

Predictability Complexity and Learning

William BialekNEC Research Institute Princeton NJ 08540 USA

Ilya NemenmanNEC Research Institute Princeton New Jersey 08540 USA and Department ofPhysics Princeton University Princeton NJ 08544 USA

Naftali TishbyNEC Research Institute Princeton NJ 08540 USA and School of Computer Scienceand Engineering and Center for Neural Computation Hebrew University Jerusalem91904 Israel

We dene predictive information Ipred(T) as the mutual information be-tween the past and the future of a time seriesThree qualitatively differentbehaviors are found in the limit of large observation times T Ipred(T) canremain nite grow logarithmically or grow as a fractional power law Ifthe time series allows us to learn a model with a nite number of param-eters then Ipred(T) grows logarithmically with a coefcient that countsthe dimensionality of the model space In contrast power-law growth isassociated for example with the learning of innite parameter (or non-parametric) models such as continuous functions with smoothness con-straints There are connections between the predictive information andmeasures of complexity that have been dened both in learning theoryand the analysis of physical systems through statistical mechanics anddynamical systems theory Furthermore in the same way that entropyprovides the unique measure of available information consistent withsome simple and plausible conditions we argue that the divergent partof Ipred(T) provides the unique measure for the complexity of dynam-ics underlying a time series Finally we discuss how these ideas may beuseful in problems in physics statistics and biology

1 Introduction

There is an obvious interest in having practical algorithms for predictingthe future and there is a correspondingly large literature on the problem oftime-series extrapolation1 But prediction is both more and less than extrap-

1 The classic papers are by Kolmogoroff (19391941)andWiener (1949)who essentiallysolved all the extrapolation problems that could be solved by linear methods Our under-

Neural Computation 13 2409ndash2463 (2001) cdeg 2001 Massachusetts Institute of Technology

2410 W Bialek I Nemenman and N Tishby

olation We might be able to predict for example the chance of rain in thecoming week even if we cannot extrapolate the trajectory of temperatureuctuations In the spirit of its thermodynamic origins information theory(Shannon 1948) characterizes the potentialities and limitations of all possi-ble prediction algorithms as well as unifying the analysis of extrapolationwith the more general notion of predictability Specically we can dene aquantitymdashthe predictive informationmdashthat measures how much our obser-vations of the past can tell us about the future The predictive informationcharacterizes the world we are observing and we shall see that this char-acterization is close to our intuition about the complexity of the underlyingdynamics

Prediction is one of the fundamental problems in neural computationMuch of what we admire in expert human performance is predictive incharacter the point guard who passes the basketball to a place where histeammate will arrive in a split second the chess master who knows howmoves made now will inuence the end game two hours hence the investorwho buys a stock in anticipation that it will grow in the year to comeMore generally we gather sensory information not for its own sake but inthe hope that this information will guide our actions (including our verbalactions) But acting takes time and sense data can guide us only to theextent that those data inform us about the state of the world at the timeof our actions so the only components of the incoming data that have achance of being useful are those that are predictive Put bluntly nonpredictiveinformation is useless to the organism and it therefore makes sense to isolatethe predictive information It will turn out that most of the informationwe collect over a long period of time is nonpredictive so that isolating thepredictive information must go a long way toward separating out thosefeatures of the sensory world that are relevant for behavior

One of the most important examples of prediction is the phenomenon ofgeneralization in learning Learning is formalized as nding a model thatexplains or describes a set of observations but again this is useful only be-cause we expect this model will continue to be valid In the language oflearning theory (see for example Vapnik 1998) an animal can gain selec-tive advantage not from its performance on the training data but only fromits performance at generalization Generalizingmdashand not ldquooverttingrdquo thetraining datamdashis precisely the problem of isolating those features of the datathat have predictive value (see also Bialek and Tishby in preparation) Fur-thermore we know that the success of generalization hinges on controllingthe complexity of the models that we are willing to consider as possibilities

standing of predictability was changed by developments in dynamical systems whichshowed that apparently random (chaotic) time series could arise from simple determin-istic rules and this led to vigorous exploration of nonlinear extrapolation algorithms(Abarbanel et al 1993) For a review comparing different approaches see the conferenceproceedings edited by Weigend and Gershenfeld (1994)

Predictability Complexity and Learning 2411

Finally learning a model to describe a data set can be seen as an encod-ing of those data as emphasized by Rissanen (1989) and the quality ofthis encoding can be measured using the ideas of information theory Thusthe exploration of learning problems should provide us with explicit linksamong the concepts of entropy predictability and complexity

The notion of complexity arises not only in learning theory but also inseveral other contexts Somephysical systems exhibit morecomplexdynam-ics than others (turbulent versus laminar ows in uids) and some systemsevolve toward more complex states than others (spin glasses versus ferro-magnets) The problem of characterizing complexity in physical systemshas a substantial literature of its own (for an overview see Bennett 1990)In this context several authors have considered complexity measures basedon entropy or mutual information although as far as we know no clearconnections have been drawn among the measures of complexity that arisein learning theory and those that arise in dynamical systems and statisticalmechanics

An essential difculty in quantifying complexity is to distinguish com-plexity from randomness A true random string cannot be compressed andhence requires a long description it thus is complex in the sense denedby Kolmogorov (1965 Li amp Vit Acircanyi 1993 Vit Acircanyi amp Li 2000) yet the phys-ical process that generates this string may have a very simple descriptionBoth in statistical mechanics and in learning theory our intuitive notionsof complexity correspond to the statements about complexity of the un-derlying process and not directly to the description length or Kolmogorovcomplexity

Our central result is that the predictive information provides a generalmeasure of complexity which includes as special cases the relevant conceptsfrom learning theory and dynamical systems While work on complexityin learning theory rests specically on the idea that one is trying to infer amodel from data the predictive information is a property of the data (ormore precisely of an ensemble of data) themselves without reference to aspecic class of underlying models If the data are generated by a process ina known class but with unknown parameters then we can calculate the pre-dictive information explicitly and show that this information diverges loga-rithmically with the size of the data set we have observed the coefcient ofthis divergence counts the number of parameters in the model or more pre-cisely the effective dimension of the model class and this provides a link toknown results of Rissanen and others Wealso can quantify the complexityofprocesses that fall outside the conventional nite dimensional models andwe show that these more complex processes are characterized by a powerlaw rather than a logarithmic divergence of the predictive information

By analogy with the analysis of critical phenomena in statistical physicsthe separation of logarithmic from power-law divergences together withthe measurement of coefcients and exponents for these divergences allowsus to dene ldquouniversality classesrdquo for the complexity of data streams The

2412 W Bialek I Nemenman and N Tishby

power law or nonparametric class of processes may be crucial in real-worldlearning tasks where the effective number of parameters becomes so largethat asymptotic results for nitely parameterizable models are inaccessiblein practice There is empirical evidence that simple physical systems cangenerate dynamics in this complexityclass and there are hints that languagealso may fall in this class

Finally we argue that the divergent components of the predictive in-formation provide a unique measure of complexity that is consistent withcertain simple requirements This argument is in the spirit of Shannonrsquosorig-inal derivation of entropy as the unique measure of available informationWe believe that this uniqueness argument provides a conclusive answerto the question of how one should quantify the complexity of a processgenerating a time series

With the evident cost of lengthening our discussion we have tried togive a self-contained presentation that develops our point of view usessimple examples to connect with known results and then generalizes andgoes beyond these results2 Even in cases where at least the qualitative formof our results is known from previous work we believe that our point ofview elucidates some issues that may have been less the focus of earlierstudies Last but not least we explore the possibilities for connecting ourtheoretical discussion with the experimental characterization of learningand complexity in neural systems

2 A Curious Observation

Before starting the systematic analysis of the problem we want to motivateour discussion further by presenting results of some simple numerical ex-periments Since most of the article draws examples from learning here weconsider examples from equilibrium statistical mechanics Suppose that wehave a one-dimensional chain of Ising spins with the Hamiltonian given by

H D iexclX

i jJijsisj (21)

where the matrix of interactions Jij is not restricted to nearest neighborslong-range interactions are also allowed One may identify spins pointingupward with 1 and downward with 0 and then a spin chain is equivalentto some sequence of binary digits This sequence consists of (overlapping)words of N digits each Wk k D 0 1 cent cent cent 2N iexcl1 There are 2N such words totaland they appear with very different frequencies n(Wk) in the spin chain (see

2 Some of the basic ideas presented here together with some connections to earlierwork can be found in brief preliminary reports (Bialek 1995 Bialek amp Tishby 1999)The central results of this work however were at best conjectures in these preliminaryaccounts

Predictability Complexity and Learning 2413

7W

1 1 1 11 1 10 0 0 0 1 0 0 0 0 0 0

W 1W 0 W 1 W 9 W 0

W = 0 0 0 0

W = 0 0 0 1

W = 0 0 1 0

W = 1 1 1 1

0

1

2

15

Figure 1 Calculating entropy of words of length 4 in a chain of 17 spins Forthis chain n(W0) D n(W1 ) D n(W3 ) D n(W7 ) D n(W12) D n(W14 ) D 2 n(W8 ) Dn(W9 ) D 1 and all other frequencies are zero Thus S(4) frac14 295 bits

Figure 1 for details) If the number of spins is large then counting thesefrequencies provides a good empirical estimate of PN (Wk) the probabil-ity distribution of different words of length N Then one can calculate theentropy S(N) of this probability distribution by the usual formula

S(N) D iexcl2N iexcl1X

kD0

PN(Wk) log2 PN(Wk) (bits) (22)

Note that this is not the entropy of a nite chain with length N instead itis the entropy of words or strings with length N drawn from a much longerchain Nonetheless since entropy is an extensive property S(N) is propor-tional asymptotically to N for any spin chain that is S(N) frac14 S0 cent N Theusual goal in statistical mechanics is to understand this ldquothermodynamiclimitrdquo N 1 and hence to calculate the entropy density S0 Different setsof interactions Jij result in different values of S0 but the qualitative resultS(N) N is true for all reasonable fJijg

We investigated three different spin chains of 1 billion spins each Asusual in statistical mechanics the probability of any conguration of spinsfsig is given by the Boltzmann distribution

P[fsig] exp(iexclH kBT) (23)

where to normalize the scale of the Jij we set kBT D 1 For the rst chainonly JiiC1 D 1 was nonzero and its value was the same for all i The secondchain was also generated using the nearest-neighbor interactions but thevalue of the coupling was reset every 400000 spins by taking a randomnumber from a gaussian distribution with zero mean and unit varianceIn the third case we again reset the interactions at the same frequencybut now interactions were long-ranged the variances of coupling constantsdecreased with the distance between the spins as hJ2

iji D 1(i iexcl j)2 We plotS(N) for all these cases in Figure 2 and of course the asymptotically linear

2414 W Bialek I Nemenman and N Tishby

5 10 15 20 250

5

10

15

20

25

N

Sfixed J variable J short range interactions variable Jrsquos long range decaying interactions

Figure 2 Entropy as a function of the word length for spin chains with differentinteractions Notice that all lines start from S(N) D log2 2 D 1 since at the valuesof the coupling we investigated the correlation length is much smaller than thechain length (1 cent 109 spins)

behavior is evidentmdashthe extensive entropy shows no qualitative distinctionamong the three cases we consider

The situation changes drastically if we remove the asymptotic linear con-tribution and focus on the corrections to extensive behavior Specically wewrite S(N) D S0 centN C S1(N) and plot only the sublinear component S1(N) ofthe entropy As we see in Figure 3 the three chains then exhibit qualitativelydifferent features for the rst one S1 is constant for the second one it islogarithmic and for the third one it clearly shows a power-law behavior

What is the signicance of these observations Of course the differencesin the behavior of S1(N) must be related to the ways we chose Jrsquos for thesimulations In the rst case J is xed and if we see N spins and try to predictthe state of the N C 1st spin all that really matters is the state of the spinsN there is nothing to ldquolearnrdquo from observations on longer segments of thechain For the second chain J changes and the statistics of the spin wordsare different in different parts of the sequence By looking at these statisticsone can ldquolearnrdquo the coupling at the current position this estimate improvesthe more spins (longer words) we observe Finally in the third case there aremany coupling constants that can be learned As N increases one becomessensitive to weaker correlationscaused by interactions over larger and larger

Predictability Complexity and Learning 2415

5 10 15 20 250

1

2

3

4

5

6

7

S1=const

S1=const

1+12 log N

S1=const

1+const

2 N05

N

S1

fixed J variable J short range interactions variable Jrsquos long range decaying interactionsfits

Figure 3 Subextensive part of the entropy as a function of the word length

distances So intuitively the qualitatively different behaviors of S1(N) in thethree plotted cases are correlated with differences in the problem of learningthe underlying dynamics of the spin chains from observations on samplesof the spins themselves Much of this article can be seen as expanding onand quantifying this intuitive observation

3 Fundamentals

The problem of prediction comes in various forms as noted in the Intro-duction Information theory allows us to treat the different notions of pre-diction on the same footing The rst step is to recognize that all predictionsare probabilistic Even if we can predict the temperature at noon tomorrowwe should provide error bars or condence limits on our prediction Thenext step is to remember that even before we look at the data we knowthat certain futures are more likely than others and we can summarize thisknowledge by a prior probability distribution for the future Our observa-tions on the past lead us to a new more tightly concentrated distributionthe distribution of futures conditional on the past data Different kinds ofpredictions are different slices through or averages over this conditionaldistribution but information theory quanties the ldquoconcentrationrdquo of thedistribution without making any commitment as to which averages will bemost interesting

2416 W Bialek I Nemenman and N Tishby

Imagine that we observe a stream of data x(t) over a time interval iexclT ltt lt 0 Let all of these past data be denoted by the shorthand xpast We areinterested in saying something about the future so we want to know aboutthe data x(t) that will be observed in the time interval 0 lt t lt T0 let these fu-ture data be called xfuture In the absence of any other knowledge futures aredrawn from the probability distribution P(xfuture) and observations of par-ticular past data xpast tell us that futures will be drawn from the conditionaldistribution P(xfuture|xpast) The greater concentration of the conditional dis-tribution can be quantied by the fact that it has smaller entropy than theprior distribution and this reduction in entropy is Shannonrsquos denition ofthe information that the past provides about the future We can write theaverage of this predictive information as

Ipred(T T 0 ) Dfrac12log2

micro P(xfuture|xpast)

P(xfuture)

para frac34(31)

D iexclhlog2 P(xfuture)i iexcl hlog2 P(xpast)iiexcl poundiexclhlog2 P(xfuture xpast)i

curren (32)

where hcent cent centi denotes an average over the joint distribution of the past andthe future P(xfuture xpast)

Each of the terms in equation 32 is an entropy Since we are interestedin predictability or generalization which are associated with some featuresof the signal persisting forever we may assume stationarity or invarianceunder time translations Then the entropy of the past data depends only onthe duration of our observations so we can write iexclhlog2 P(xpast)i D S(T)and by the same argument iexclhlog2 P(xfuture)i D S(T0 ) Finally the entropyof the past and the future taken together is the entropy of observations ona window of duration T C T0 so that iexclhlog2 P(xfuture xpast)i D S(T C T0 )Putting these equations together we obtain

Ipred(T T0 ) D S(T) C S(T0 ) iexcl S(T C T0 ) (33)

It is important to recall that mutual information is a symmetric quantityThus we can view Ipred(T T0 ) as either the information that a data segmentof duration T provides about the future of length T0 or the information thata data segment of duration T0 provides about the immediate past of dura-tion T This is a direct application of the denition of information but seemscounterintuitive Shouldnrsquot it be more difcult to predict than to postdictOne can perhaps recover the correct intuition by thinking about a large en-semble of question-answer pairs Prediction corresponds to generating theanswer to a given question while postdiction corresponds to generatingthe question that goes with a given answer We know that guessing ques-tions given answers is also a hard problem3 and can be just as challenging

3 This is the basis of the popular American television game show Jeopardy widelyviewed as the most ldquointellectualrdquo of its genre

Predictability Complexity and Learning 2417

as the more conventional problem of answering the questions themselvesOur focus here is on prediction because we want to make connections withthe phenomenon of generalization in learning but it is clear that generat-ing a consistent interpretation of observed data may involve elements ofboth prediction and postdiction (see for example Eagleman amp Sejnowski2000) it is attractive that the information-theoretic formulation treats theseproblems symmetrically

In the same way that the entropy of a gas at xed density is proportionalto the volume the entropy of a time series (asymptotically) is proportional toits duration so that limT1 S(T)T D S0 entropy is an extensive quantityBut from Equation 33 any extensive component of the entropy cancels inthe computation of the predictive information predictability is a deviationfrom extensivity If we write S(T) D S0T C S1(T) then equation 33 tellsus that the predictive information is related only to the nonextensive termS1(T) Note that if we are observing a deterministic system then S0 D 0 butthis is independent of questions about the structure of the subextensive termS1(T) It is attractive that information theory gives us a unied discussionof prediction in deterministic and probabilistic cases

We know two general facts about the behavior of S1(T) First the correc-tions to extensive behavior are positive S1(T) cedil 0 Second the statementthat entropy is extensive is the statement that the limit

limT1

S(T)

TD S0 (34)

exists and for this to be true we must also have

limT1

S1(T)T

D 0 (35)

Thus the nonextensive terms in the entropy must be subextensive that isthey must grow with T less rapidly than a linear function Taken togetherthese facts guarantee that the predictive information is positive and subex-tensive Furthermore if we let the future extend forward for a very longtime T0 1 then we can measure the information that our sample pro-vides about the entire future

Ipred(T) D limT01

Ipred(T T0 ) D S1(T) (36)

Similarly instead of increasing the duration of the future to innity wecould have considered the mutual information between a sample of lengthT and all of the innite past Then the postdictive information also is equal toS1(T) and the symmetry between prediction and postdiction is even moreprofound not only is there symmetry between questions and answers butobservations on a given period of time provide the same amount of infor-mation about the historical path that led to our observations as about the

2418 W Bialek I Nemenman and N Tishby

future that will unfold from them In some cases this statement becomeseven stronger For example if the subextensive entropy of a long discon-tinuous observation of a total length T with a gap of a duration dT iquest T isequal to S1(T) C O(dT

T ) then the subextensive entropy of the present is notonly its information about the past or the future but also the informationabout the past and the future

If we have been observing a time series for a (long) time T then the totalamount of data we have collected is measured by the entropy S(T) and atlarge T this is given approximately by S0T But the predictive informationthat we have gathered cannot grow linearly with time even if we are makingpredictions about a future that stretches out to innity As a result of thetotal information we have taken in by observing xpast only a vanishingfraction is of relevance to the prediction

limT1

Predictive informationTotal information

DIpred(T)

S(T) 0 (37)

In this precise sense most of what we observe is irrelevant to the problemof predicting the future4

Consider the case where time is measured in discrete steps so that wehave seen N time pointsx1 x2 xN How much have we learned about theunderlying pattern in these data The more we know the more effectivelywe can predict the next data point xNC1 and hence the fewer bits we willneed to describe the deviation of this data point from our prediction Ouraccumulated knowledge about the time series is measured by the degree towhich we can compress the description of new observations On averagethe length of the code word required to describe the point xNC1 given thatwe have seen the previous N points is given by

(N) D iexclhlog2 P(xNC1 | x1 x2 xN)i bits (38)

where the expectation value is taken over the joint distribution of all theN C 1 points P(x1 x2 xN xNC1) It is easy to see that

(N) D S(N C 1) iexcl S(N) frac14 S(N)

N (39)

As we observe for longer times we learn more and this word length de-creases It is natural to dene a learning curve that measures this improve-ment Usually we dene learning curves by measuring the frequency or

4 We can think of equation 37 asa law of diminishing returns Although we collect datain proportion to our observation time T a smaller and smaller fraction of this informationis useful in the problem of prediction These diminishing returns are not due to a limitedlifetime since we calculate the predictive information assuming that we have a futureextending forward to innity A senior colleague points out that this is an argument forchanging elds before becoming too expert

Predictability Complexity and Learning 2419

costs of errors here the cost is that our encoding of the point xNC1 is longerthan it could be if we had perfect knowledge This ideal encoding has alength that we can nd by imagining that we observe the time series for aninnitely long time ideal D limN1 (N) but this is just another way ofdening the extensive component of the entropy S0 Thus we can dene alearning curve

L (N) acute (N) iexcl ideal (310)

D S(N C 1) iexcl S(N) iexcl S0

D S1(N C 1) iexcl S1(N)

frac14 S1(N)

ND

Ipred(N)

N (311)

and we see once again that the extensive component of the entropy cancelsIt is well known that the problems of prediction and compression are

related and what we have done here is to illustrate one aspect of this con-nection Specically if we ask how much one segment of a time series cantell us about the future the answer is contained in the subextensive behaviorof the entropy If we ask how much we are learning about the structure ofthe time series then the natural and universally dened learning curve is re-lated again to the subextensive entropy the learning curve is the derivativeof the predictive information

This universal learning curve is connected to the more conventionallearning curves in specic contexts As an example (cf section 41) considertting a set of data points fxn yng with some class of functions y D f (xI reg)where reg are unknown parameters that need to be learned we also allow forsome gaussian noise in our observation of the yn Here the natural learningcurve is the evolution of Acirc

2 for generalization as a function of the number ofexamples Within the approximations discussed below it is straightforwardto show that as N becomes large

DAcirc

2(N)E

D1

s2

Dpoundy iexcl f (xI reg)

curren2E (2 ln 2) L (N) C 1 (312)

where s2 is the variance of the noise Thus a more conventional measure

of performance at learning a function is equal to the universal learningcurve dened purely by information-theoretic criteria In other words if alearning curve is measured in the right units then its integral represents theamount of the useful information accumulated Then the subextensivity ofS1 guarantees that the learning curve decreases to zero as N 1

Different quantities related to the subextensive entropy have been dis-cussed in several contexts For example the code length (N) has beendened as a learning curve in the specic case of neural networks (Opperamp Haussler 1995) and has been termed the ldquothermodynamic diverdquo (Crutch-eld amp Shalizi 1999) and ldquoNth order block entropyrdquo (Grassberger 1986)

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 2: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2410 W Bialek I Nemenman and N Tishby

olation We might be able to predict for example the chance of rain in thecoming week even if we cannot extrapolate the trajectory of temperatureuctuations In the spirit of its thermodynamic origins information theory(Shannon 1948) characterizes the potentialities and limitations of all possi-ble prediction algorithms as well as unifying the analysis of extrapolationwith the more general notion of predictability Specically we can dene aquantitymdashthe predictive informationmdashthat measures how much our obser-vations of the past can tell us about the future The predictive informationcharacterizes the world we are observing and we shall see that this char-acterization is close to our intuition about the complexity of the underlyingdynamics

Prediction is one of the fundamental problems in neural computationMuch of what we admire in expert human performance is predictive incharacter the point guard who passes the basketball to a place where histeammate will arrive in a split second the chess master who knows howmoves made now will inuence the end game two hours hence the investorwho buys a stock in anticipation that it will grow in the year to comeMore generally we gather sensory information not for its own sake but inthe hope that this information will guide our actions (including our verbalactions) But acting takes time and sense data can guide us only to theextent that those data inform us about the state of the world at the timeof our actions so the only components of the incoming data that have achance of being useful are those that are predictive Put bluntly nonpredictiveinformation is useless to the organism and it therefore makes sense to isolatethe predictive information It will turn out that most of the informationwe collect over a long period of time is nonpredictive so that isolating thepredictive information must go a long way toward separating out thosefeatures of the sensory world that are relevant for behavior

One of the most important examples of prediction is the phenomenon ofgeneralization in learning Learning is formalized as nding a model thatexplains or describes a set of observations but again this is useful only be-cause we expect this model will continue to be valid In the language oflearning theory (see for example Vapnik 1998) an animal can gain selec-tive advantage not from its performance on the training data but only fromits performance at generalization Generalizingmdashand not ldquooverttingrdquo thetraining datamdashis precisely the problem of isolating those features of the datathat have predictive value (see also Bialek and Tishby in preparation) Fur-thermore we know that the success of generalization hinges on controllingthe complexity of the models that we are willing to consider as possibilities

standing of predictability was changed by developments in dynamical systems whichshowed that apparently random (chaotic) time series could arise from simple determin-istic rules and this led to vigorous exploration of nonlinear extrapolation algorithms(Abarbanel et al 1993) For a review comparing different approaches see the conferenceproceedings edited by Weigend and Gershenfeld (1994)

Predictability Complexity and Learning 2411

Finally learning a model to describe a data set can be seen as an encod-ing of those data as emphasized by Rissanen (1989) and the quality ofthis encoding can be measured using the ideas of information theory Thusthe exploration of learning problems should provide us with explicit linksamong the concepts of entropy predictability and complexity

The notion of complexity arises not only in learning theory but also inseveral other contexts Somephysical systems exhibit morecomplexdynam-ics than others (turbulent versus laminar ows in uids) and some systemsevolve toward more complex states than others (spin glasses versus ferro-magnets) The problem of characterizing complexity in physical systemshas a substantial literature of its own (for an overview see Bennett 1990)In this context several authors have considered complexity measures basedon entropy or mutual information although as far as we know no clearconnections have been drawn among the measures of complexity that arisein learning theory and those that arise in dynamical systems and statisticalmechanics

An essential difculty in quantifying complexity is to distinguish com-plexity from randomness A true random string cannot be compressed andhence requires a long description it thus is complex in the sense denedby Kolmogorov (1965 Li amp Vit Acircanyi 1993 Vit Acircanyi amp Li 2000) yet the phys-ical process that generates this string may have a very simple descriptionBoth in statistical mechanics and in learning theory our intuitive notionsof complexity correspond to the statements about complexity of the un-derlying process and not directly to the description length or Kolmogorovcomplexity

Our central result is that the predictive information provides a generalmeasure of complexity which includes as special cases the relevant conceptsfrom learning theory and dynamical systems While work on complexityin learning theory rests specically on the idea that one is trying to infer amodel from data the predictive information is a property of the data (ormore precisely of an ensemble of data) themselves without reference to aspecic class of underlying models If the data are generated by a process ina known class but with unknown parameters then we can calculate the pre-dictive information explicitly and show that this information diverges loga-rithmically with the size of the data set we have observed the coefcient ofthis divergence counts the number of parameters in the model or more pre-cisely the effective dimension of the model class and this provides a link toknown results of Rissanen and others Wealso can quantify the complexityofprocesses that fall outside the conventional nite dimensional models andwe show that these more complex processes are characterized by a powerlaw rather than a logarithmic divergence of the predictive information

By analogy with the analysis of critical phenomena in statistical physicsthe separation of logarithmic from power-law divergences together withthe measurement of coefcients and exponents for these divergences allowsus to dene ldquouniversality classesrdquo for the complexity of data streams The

2412 W Bialek I Nemenman and N Tishby

power law or nonparametric class of processes may be crucial in real-worldlearning tasks where the effective number of parameters becomes so largethat asymptotic results for nitely parameterizable models are inaccessiblein practice There is empirical evidence that simple physical systems cangenerate dynamics in this complexityclass and there are hints that languagealso may fall in this class

Finally we argue that the divergent components of the predictive in-formation provide a unique measure of complexity that is consistent withcertain simple requirements This argument is in the spirit of Shannonrsquosorig-inal derivation of entropy as the unique measure of available informationWe believe that this uniqueness argument provides a conclusive answerto the question of how one should quantify the complexity of a processgenerating a time series

With the evident cost of lengthening our discussion we have tried togive a self-contained presentation that develops our point of view usessimple examples to connect with known results and then generalizes andgoes beyond these results2 Even in cases where at least the qualitative formof our results is known from previous work we believe that our point ofview elucidates some issues that may have been less the focus of earlierstudies Last but not least we explore the possibilities for connecting ourtheoretical discussion with the experimental characterization of learningand complexity in neural systems

2 A Curious Observation

Before starting the systematic analysis of the problem we want to motivateour discussion further by presenting results of some simple numerical ex-periments Since most of the article draws examples from learning here weconsider examples from equilibrium statistical mechanics Suppose that wehave a one-dimensional chain of Ising spins with the Hamiltonian given by

H D iexclX

i jJijsisj (21)

where the matrix of interactions Jij is not restricted to nearest neighborslong-range interactions are also allowed One may identify spins pointingupward with 1 and downward with 0 and then a spin chain is equivalentto some sequence of binary digits This sequence consists of (overlapping)words of N digits each Wk k D 0 1 cent cent cent 2N iexcl1 There are 2N such words totaland they appear with very different frequencies n(Wk) in the spin chain (see

2 Some of the basic ideas presented here together with some connections to earlierwork can be found in brief preliminary reports (Bialek 1995 Bialek amp Tishby 1999)The central results of this work however were at best conjectures in these preliminaryaccounts

Predictability Complexity and Learning 2413

7W

1 1 1 11 1 10 0 0 0 1 0 0 0 0 0 0

W 1W 0 W 1 W 9 W 0

W = 0 0 0 0

W = 0 0 0 1

W = 0 0 1 0

W = 1 1 1 1

0

1

2

15

Figure 1 Calculating entropy of words of length 4 in a chain of 17 spins Forthis chain n(W0) D n(W1 ) D n(W3 ) D n(W7 ) D n(W12) D n(W14 ) D 2 n(W8 ) Dn(W9 ) D 1 and all other frequencies are zero Thus S(4) frac14 295 bits

Figure 1 for details) If the number of spins is large then counting thesefrequencies provides a good empirical estimate of PN (Wk) the probabil-ity distribution of different words of length N Then one can calculate theentropy S(N) of this probability distribution by the usual formula

S(N) D iexcl2N iexcl1X

kD0

PN(Wk) log2 PN(Wk) (bits) (22)

Note that this is not the entropy of a nite chain with length N instead itis the entropy of words or strings with length N drawn from a much longerchain Nonetheless since entropy is an extensive property S(N) is propor-tional asymptotically to N for any spin chain that is S(N) frac14 S0 cent N Theusual goal in statistical mechanics is to understand this ldquothermodynamiclimitrdquo N 1 and hence to calculate the entropy density S0 Different setsof interactions Jij result in different values of S0 but the qualitative resultS(N) N is true for all reasonable fJijg

We investigated three different spin chains of 1 billion spins each Asusual in statistical mechanics the probability of any conguration of spinsfsig is given by the Boltzmann distribution

P[fsig] exp(iexclH kBT) (23)

where to normalize the scale of the Jij we set kBT D 1 For the rst chainonly JiiC1 D 1 was nonzero and its value was the same for all i The secondchain was also generated using the nearest-neighbor interactions but thevalue of the coupling was reset every 400000 spins by taking a randomnumber from a gaussian distribution with zero mean and unit varianceIn the third case we again reset the interactions at the same frequencybut now interactions were long-ranged the variances of coupling constantsdecreased with the distance between the spins as hJ2

iji D 1(i iexcl j)2 We plotS(N) for all these cases in Figure 2 and of course the asymptotically linear

2414 W Bialek I Nemenman and N Tishby

5 10 15 20 250

5

10

15

20

25

N

Sfixed J variable J short range interactions variable Jrsquos long range decaying interactions

Figure 2 Entropy as a function of the word length for spin chains with differentinteractions Notice that all lines start from S(N) D log2 2 D 1 since at the valuesof the coupling we investigated the correlation length is much smaller than thechain length (1 cent 109 spins)

behavior is evidentmdashthe extensive entropy shows no qualitative distinctionamong the three cases we consider

The situation changes drastically if we remove the asymptotic linear con-tribution and focus on the corrections to extensive behavior Specically wewrite S(N) D S0 centN C S1(N) and plot only the sublinear component S1(N) ofthe entropy As we see in Figure 3 the three chains then exhibit qualitativelydifferent features for the rst one S1 is constant for the second one it islogarithmic and for the third one it clearly shows a power-law behavior

What is the signicance of these observations Of course the differencesin the behavior of S1(N) must be related to the ways we chose Jrsquos for thesimulations In the rst case J is xed and if we see N spins and try to predictthe state of the N C 1st spin all that really matters is the state of the spinsN there is nothing to ldquolearnrdquo from observations on longer segments of thechain For the second chain J changes and the statistics of the spin wordsare different in different parts of the sequence By looking at these statisticsone can ldquolearnrdquo the coupling at the current position this estimate improvesthe more spins (longer words) we observe Finally in the third case there aremany coupling constants that can be learned As N increases one becomessensitive to weaker correlationscaused by interactions over larger and larger

Predictability Complexity and Learning 2415

5 10 15 20 250

1

2

3

4

5

6

7

S1=const

S1=const

1+12 log N

S1=const

1+const

2 N05

N

S1

fixed J variable J short range interactions variable Jrsquos long range decaying interactionsfits

Figure 3 Subextensive part of the entropy as a function of the word length

distances So intuitively the qualitatively different behaviors of S1(N) in thethree plotted cases are correlated with differences in the problem of learningthe underlying dynamics of the spin chains from observations on samplesof the spins themselves Much of this article can be seen as expanding onand quantifying this intuitive observation

3 Fundamentals

The problem of prediction comes in various forms as noted in the Intro-duction Information theory allows us to treat the different notions of pre-diction on the same footing The rst step is to recognize that all predictionsare probabilistic Even if we can predict the temperature at noon tomorrowwe should provide error bars or condence limits on our prediction Thenext step is to remember that even before we look at the data we knowthat certain futures are more likely than others and we can summarize thisknowledge by a prior probability distribution for the future Our observa-tions on the past lead us to a new more tightly concentrated distributionthe distribution of futures conditional on the past data Different kinds ofpredictions are different slices through or averages over this conditionaldistribution but information theory quanties the ldquoconcentrationrdquo of thedistribution without making any commitment as to which averages will bemost interesting

2416 W Bialek I Nemenman and N Tishby

Imagine that we observe a stream of data x(t) over a time interval iexclT ltt lt 0 Let all of these past data be denoted by the shorthand xpast We areinterested in saying something about the future so we want to know aboutthe data x(t) that will be observed in the time interval 0 lt t lt T0 let these fu-ture data be called xfuture In the absence of any other knowledge futures aredrawn from the probability distribution P(xfuture) and observations of par-ticular past data xpast tell us that futures will be drawn from the conditionaldistribution P(xfuture|xpast) The greater concentration of the conditional dis-tribution can be quantied by the fact that it has smaller entropy than theprior distribution and this reduction in entropy is Shannonrsquos denition ofthe information that the past provides about the future We can write theaverage of this predictive information as

Ipred(T T 0 ) Dfrac12log2

micro P(xfuture|xpast)

P(xfuture)

para frac34(31)

D iexclhlog2 P(xfuture)i iexcl hlog2 P(xpast)iiexcl poundiexclhlog2 P(xfuture xpast)i

curren (32)

where hcent cent centi denotes an average over the joint distribution of the past andthe future P(xfuture xpast)

Each of the terms in equation 32 is an entropy Since we are interestedin predictability or generalization which are associated with some featuresof the signal persisting forever we may assume stationarity or invarianceunder time translations Then the entropy of the past data depends only onthe duration of our observations so we can write iexclhlog2 P(xpast)i D S(T)and by the same argument iexclhlog2 P(xfuture)i D S(T0 ) Finally the entropyof the past and the future taken together is the entropy of observations ona window of duration T C T0 so that iexclhlog2 P(xfuture xpast)i D S(T C T0 )Putting these equations together we obtain

Ipred(T T0 ) D S(T) C S(T0 ) iexcl S(T C T0 ) (33)

It is important to recall that mutual information is a symmetric quantityThus we can view Ipred(T T0 ) as either the information that a data segmentof duration T provides about the future of length T0 or the information thata data segment of duration T0 provides about the immediate past of dura-tion T This is a direct application of the denition of information but seemscounterintuitive Shouldnrsquot it be more difcult to predict than to postdictOne can perhaps recover the correct intuition by thinking about a large en-semble of question-answer pairs Prediction corresponds to generating theanswer to a given question while postdiction corresponds to generatingthe question that goes with a given answer We know that guessing ques-tions given answers is also a hard problem3 and can be just as challenging

3 This is the basis of the popular American television game show Jeopardy widelyviewed as the most ldquointellectualrdquo of its genre

Predictability Complexity and Learning 2417

as the more conventional problem of answering the questions themselvesOur focus here is on prediction because we want to make connections withthe phenomenon of generalization in learning but it is clear that generat-ing a consistent interpretation of observed data may involve elements ofboth prediction and postdiction (see for example Eagleman amp Sejnowski2000) it is attractive that the information-theoretic formulation treats theseproblems symmetrically

In the same way that the entropy of a gas at xed density is proportionalto the volume the entropy of a time series (asymptotically) is proportional toits duration so that limT1 S(T)T D S0 entropy is an extensive quantityBut from Equation 33 any extensive component of the entropy cancels inthe computation of the predictive information predictability is a deviationfrom extensivity If we write S(T) D S0T C S1(T) then equation 33 tellsus that the predictive information is related only to the nonextensive termS1(T) Note that if we are observing a deterministic system then S0 D 0 butthis is independent of questions about the structure of the subextensive termS1(T) It is attractive that information theory gives us a unied discussionof prediction in deterministic and probabilistic cases

We know two general facts about the behavior of S1(T) First the correc-tions to extensive behavior are positive S1(T) cedil 0 Second the statementthat entropy is extensive is the statement that the limit

limT1

S(T)

TD S0 (34)

exists and for this to be true we must also have

limT1

S1(T)T

D 0 (35)

Thus the nonextensive terms in the entropy must be subextensive that isthey must grow with T less rapidly than a linear function Taken togetherthese facts guarantee that the predictive information is positive and subex-tensive Furthermore if we let the future extend forward for a very longtime T0 1 then we can measure the information that our sample pro-vides about the entire future

Ipred(T) D limT01

Ipred(T T0 ) D S1(T) (36)

Similarly instead of increasing the duration of the future to innity wecould have considered the mutual information between a sample of lengthT and all of the innite past Then the postdictive information also is equal toS1(T) and the symmetry between prediction and postdiction is even moreprofound not only is there symmetry between questions and answers butobservations on a given period of time provide the same amount of infor-mation about the historical path that led to our observations as about the

2418 W Bialek I Nemenman and N Tishby

future that will unfold from them In some cases this statement becomeseven stronger For example if the subextensive entropy of a long discon-tinuous observation of a total length T with a gap of a duration dT iquest T isequal to S1(T) C O(dT

T ) then the subextensive entropy of the present is notonly its information about the past or the future but also the informationabout the past and the future

If we have been observing a time series for a (long) time T then the totalamount of data we have collected is measured by the entropy S(T) and atlarge T this is given approximately by S0T But the predictive informationthat we have gathered cannot grow linearly with time even if we are makingpredictions about a future that stretches out to innity As a result of thetotal information we have taken in by observing xpast only a vanishingfraction is of relevance to the prediction

limT1

Predictive informationTotal information

DIpred(T)

S(T) 0 (37)

In this precise sense most of what we observe is irrelevant to the problemof predicting the future4

Consider the case where time is measured in discrete steps so that wehave seen N time pointsx1 x2 xN How much have we learned about theunderlying pattern in these data The more we know the more effectivelywe can predict the next data point xNC1 and hence the fewer bits we willneed to describe the deviation of this data point from our prediction Ouraccumulated knowledge about the time series is measured by the degree towhich we can compress the description of new observations On averagethe length of the code word required to describe the point xNC1 given thatwe have seen the previous N points is given by

(N) D iexclhlog2 P(xNC1 | x1 x2 xN)i bits (38)

where the expectation value is taken over the joint distribution of all theN C 1 points P(x1 x2 xN xNC1) It is easy to see that

(N) D S(N C 1) iexcl S(N) frac14 S(N)

N (39)

As we observe for longer times we learn more and this word length de-creases It is natural to dene a learning curve that measures this improve-ment Usually we dene learning curves by measuring the frequency or

4 We can think of equation 37 asa law of diminishing returns Although we collect datain proportion to our observation time T a smaller and smaller fraction of this informationis useful in the problem of prediction These diminishing returns are not due to a limitedlifetime since we calculate the predictive information assuming that we have a futureextending forward to innity A senior colleague points out that this is an argument forchanging elds before becoming too expert

Predictability Complexity and Learning 2419

costs of errors here the cost is that our encoding of the point xNC1 is longerthan it could be if we had perfect knowledge This ideal encoding has alength that we can nd by imagining that we observe the time series for aninnitely long time ideal D limN1 (N) but this is just another way ofdening the extensive component of the entropy S0 Thus we can dene alearning curve

L (N) acute (N) iexcl ideal (310)

D S(N C 1) iexcl S(N) iexcl S0

D S1(N C 1) iexcl S1(N)

frac14 S1(N)

ND

Ipred(N)

N (311)

and we see once again that the extensive component of the entropy cancelsIt is well known that the problems of prediction and compression are

related and what we have done here is to illustrate one aspect of this con-nection Specically if we ask how much one segment of a time series cantell us about the future the answer is contained in the subextensive behaviorof the entropy If we ask how much we are learning about the structure ofthe time series then the natural and universally dened learning curve is re-lated again to the subextensive entropy the learning curve is the derivativeof the predictive information

This universal learning curve is connected to the more conventionallearning curves in specic contexts As an example (cf section 41) considertting a set of data points fxn yng with some class of functions y D f (xI reg)where reg are unknown parameters that need to be learned we also allow forsome gaussian noise in our observation of the yn Here the natural learningcurve is the evolution of Acirc

2 for generalization as a function of the number ofexamples Within the approximations discussed below it is straightforwardto show that as N becomes large

DAcirc

2(N)E

D1

s2

Dpoundy iexcl f (xI reg)

curren2E (2 ln 2) L (N) C 1 (312)

where s2 is the variance of the noise Thus a more conventional measure

of performance at learning a function is equal to the universal learningcurve dened purely by information-theoretic criteria In other words if alearning curve is measured in the right units then its integral represents theamount of the useful information accumulated Then the subextensivity ofS1 guarantees that the learning curve decreases to zero as N 1

Different quantities related to the subextensive entropy have been dis-cussed in several contexts For example the code length (N) has beendened as a learning curve in the specic case of neural networks (Opperamp Haussler 1995) and has been termed the ldquothermodynamic diverdquo (Crutch-eld amp Shalizi 1999) and ldquoNth order block entropyrdquo (Grassberger 1986)

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 3: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2411

Finally learning a model to describe a data set can be seen as an encod-ing of those data as emphasized by Rissanen (1989) and the quality ofthis encoding can be measured using the ideas of information theory Thusthe exploration of learning problems should provide us with explicit linksamong the concepts of entropy predictability and complexity

The notion of complexity arises not only in learning theory but also inseveral other contexts Somephysical systems exhibit morecomplexdynam-ics than others (turbulent versus laminar ows in uids) and some systemsevolve toward more complex states than others (spin glasses versus ferro-magnets) The problem of characterizing complexity in physical systemshas a substantial literature of its own (for an overview see Bennett 1990)In this context several authors have considered complexity measures basedon entropy or mutual information although as far as we know no clearconnections have been drawn among the measures of complexity that arisein learning theory and those that arise in dynamical systems and statisticalmechanics

An essential difculty in quantifying complexity is to distinguish com-plexity from randomness A true random string cannot be compressed andhence requires a long description it thus is complex in the sense denedby Kolmogorov (1965 Li amp Vit Acircanyi 1993 Vit Acircanyi amp Li 2000) yet the phys-ical process that generates this string may have a very simple descriptionBoth in statistical mechanics and in learning theory our intuitive notionsof complexity correspond to the statements about complexity of the un-derlying process and not directly to the description length or Kolmogorovcomplexity

Our central result is that the predictive information provides a generalmeasure of complexity which includes as special cases the relevant conceptsfrom learning theory and dynamical systems While work on complexityin learning theory rests specically on the idea that one is trying to infer amodel from data the predictive information is a property of the data (ormore precisely of an ensemble of data) themselves without reference to aspecic class of underlying models If the data are generated by a process ina known class but with unknown parameters then we can calculate the pre-dictive information explicitly and show that this information diverges loga-rithmically with the size of the data set we have observed the coefcient ofthis divergence counts the number of parameters in the model or more pre-cisely the effective dimension of the model class and this provides a link toknown results of Rissanen and others Wealso can quantify the complexityofprocesses that fall outside the conventional nite dimensional models andwe show that these more complex processes are characterized by a powerlaw rather than a logarithmic divergence of the predictive information

By analogy with the analysis of critical phenomena in statistical physicsthe separation of logarithmic from power-law divergences together withthe measurement of coefcients and exponents for these divergences allowsus to dene ldquouniversality classesrdquo for the complexity of data streams The

2412 W Bialek I Nemenman and N Tishby

power law or nonparametric class of processes may be crucial in real-worldlearning tasks where the effective number of parameters becomes so largethat asymptotic results for nitely parameterizable models are inaccessiblein practice There is empirical evidence that simple physical systems cangenerate dynamics in this complexityclass and there are hints that languagealso may fall in this class

Finally we argue that the divergent components of the predictive in-formation provide a unique measure of complexity that is consistent withcertain simple requirements This argument is in the spirit of Shannonrsquosorig-inal derivation of entropy as the unique measure of available informationWe believe that this uniqueness argument provides a conclusive answerto the question of how one should quantify the complexity of a processgenerating a time series

With the evident cost of lengthening our discussion we have tried togive a self-contained presentation that develops our point of view usessimple examples to connect with known results and then generalizes andgoes beyond these results2 Even in cases where at least the qualitative formof our results is known from previous work we believe that our point ofview elucidates some issues that may have been less the focus of earlierstudies Last but not least we explore the possibilities for connecting ourtheoretical discussion with the experimental characterization of learningand complexity in neural systems

2 A Curious Observation

Before starting the systematic analysis of the problem we want to motivateour discussion further by presenting results of some simple numerical ex-periments Since most of the article draws examples from learning here weconsider examples from equilibrium statistical mechanics Suppose that wehave a one-dimensional chain of Ising spins with the Hamiltonian given by

H D iexclX

i jJijsisj (21)

where the matrix of interactions Jij is not restricted to nearest neighborslong-range interactions are also allowed One may identify spins pointingupward with 1 and downward with 0 and then a spin chain is equivalentto some sequence of binary digits This sequence consists of (overlapping)words of N digits each Wk k D 0 1 cent cent cent 2N iexcl1 There are 2N such words totaland they appear with very different frequencies n(Wk) in the spin chain (see

2 Some of the basic ideas presented here together with some connections to earlierwork can be found in brief preliminary reports (Bialek 1995 Bialek amp Tishby 1999)The central results of this work however were at best conjectures in these preliminaryaccounts

Predictability Complexity and Learning 2413

7W

1 1 1 11 1 10 0 0 0 1 0 0 0 0 0 0

W 1W 0 W 1 W 9 W 0

W = 0 0 0 0

W = 0 0 0 1

W = 0 0 1 0

W = 1 1 1 1

0

1

2

15

Figure 1 Calculating entropy of words of length 4 in a chain of 17 spins Forthis chain n(W0) D n(W1 ) D n(W3 ) D n(W7 ) D n(W12) D n(W14 ) D 2 n(W8 ) Dn(W9 ) D 1 and all other frequencies are zero Thus S(4) frac14 295 bits

Figure 1 for details) If the number of spins is large then counting thesefrequencies provides a good empirical estimate of PN (Wk) the probabil-ity distribution of different words of length N Then one can calculate theentropy S(N) of this probability distribution by the usual formula

S(N) D iexcl2N iexcl1X

kD0

PN(Wk) log2 PN(Wk) (bits) (22)

Note that this is not the entropy of a nite chain with length N instead itis the entropy of words or strings with length N drawn from a much longerchain Nonetheless since entropy is an extensive property S(N) is propor-tional asymptotically to N for any spin chain that is S(N) frac14 S0 cent N Theusual goal in statistical mechanics is to understand this ldquothermodynamiclimitrdquo N 1 and hence to calculate the entropy density S0 Different setsof interactions Jij result in different values of S0 but the qualitative resultS(N) N is true for all reasonable fJijg

We investigated three different spin chains of 1 billion spins each Asusual in statistical mechanics the probability of any conguration of spinsfsig is given by the Boltzmann distribution

P[fsig] exp(iexclH kBT) (23)

where to normalize the scale of the Jij we set kBT D 1 For the rst chainonly JiiC1 D 1 was nonzero and its value was the same for all i The secondchain was also generated using the nearest-neighbor interactions but thevalue of the coupling was reset every 400000 spins by taking a randomnumber from a gaussian distribution with zero mean and unit varianceIn the third case we again reset the interactions at the same frequencybut now interactions were long-ranged the variances of coupling constantsdecreased with the distance between the spins as hJ2

iji D 1(i iexcl j)2 We plotS(N) for all these cases in Figure 2 and of course the asymptotically linear

2414 W Bialek I Nemenman and N Tishby

5 10 15 20 250

5

10

15

20

25

N

Sfixed J variable J short range interactions variable Jrsquos long range decaying interactions

Figure 2 Entropy as a function of the word length for spin chains with differentinteractions Notice that all lines start from S(N) D log2 2 D 1 since at the valuesof the coupling we investigated the correlation length is much smaller than thechain length (1 cent 109 spins)

behavior is evidentmdashthe extensive entropy shows no qualitative distinctionamong the three cases we consider

The situation changes drastically if we remove the asymptotic linear con-tribution and focus on the corrections to extensive behavior Specically wewrite S(N) D S0 centN C S1(N) and plot only the sublinear component S1(N) ofthe entropy As we see in Figure 3 the three chains then exhibit qualitativelydifferent features for the rst one S1 is constant for the second one it islogarithmic and for the third one it clearly shows a power-law behavior

What is the signicance of these observations Of course the differencesin the behavior of S1(N) must be related to the ways we chose Jrsquos for thesimulations In the rst case J is xed and if we see N spins and try to predictthe state of the N C 1st spin all that really matters is the state of the spinsN there is nothing to ldquolearnrdquo from observations on longer segments of thechain For the second chain J changes and the statistics of the spin wordsare different in different parts of the sequence By looking at these statisticsone can ldquolearnrdquo the coupling at the current position this estimate improvesthe more spins (longer words) we observe Finally in the third case there aremany coupling constants that can be learned As N increases one becomessensitive to weaker correlationscaused by interactions over larger and larger

Predictability Complexity and Learning 2415

5 10 15 20 250

1

2

3

4

5

6

7

S1=const

S1=const

1+12 log N

S1=const

1+const

2 N05

N

S1

fixed J variable J short range interactions variable Jrsquos long range decaying interactionsfits

Figure 3 Subextensive part of the entropy as a function of the word length

distances So intuitively the qualitatively different behaviors of S1(N) in thethree plotted cases are correlated with differences in the problem of learningthe underlying dynamics of the spin chains from observations on samplesof the spins themselves Much of this article can be seen as expanding onand quantifying this intuitive observation

3 Fundamentals

The problem of prediction comes in various forms as noted in the Intro-duction Information theory allows us to treat the different notions of pre-diction on the same footing The rst step is to recognize that all predictionsare probabilistic Even if we can predict the temperature at noon tomorrowwe should provide error bars or condence limits on our prediction Thenext step is to remember that even before we look at the data we knowthat certain futures are more likely than others and we can summarize thisknowledge by a prior probability distribution for the future Our observa-tions on the past lead us to a new more tightly concentrated distributionthe distribution of futures conditional on the past data Different kinds ofpredictions are different slices through or averages over this conditionaldistribution but information theory quanties the ldquoconcentrationrdquo of thedistribution without making any commitment as to which averages will bemost interesting

2416 W Bialek I Nemenman and N Tishby

Imagine that we observe a stream of data x(t) over a time interval iexclT ltt lt 0 Let all of these past data be denoted by the shorthand xpast We areinterested in saying something about the future so we want to know aboutthe data x(t) that will be observed in the time interval 0 lt t lt T0 let these fu-ture data be called xfuture In the absence of any other knowledge futures aredrawn from the probability distribution P(xfuture) and observations of par-ticular past data xpast tell us that futures will be drawn from the conditionaldistribution P(xfuture|xpast) The greater concentration of the conditional dis-tribution can be quantied by the fact that it has smaller entropy than theprior distribution and this reduction in entropy is Shannonrsquos denition ofthe information that the past provides about the future We can write theaverage of this predictive information as

Ipred(T T 0 ) Dfrac12log2

micro P(xfuture|xpast)

P(xfuture)

para frac34(31)

D iexclhlog2 P(xfuture)i iexcl hlog2 P(xpast)iiexcl poundiexclhlog2 P(xfuture xpast)i

curren (32)

where hcent cent centi denotes an average over the joint distribution of the past andthe future P(xfuture xpast)

Each of the terms in equation 32 is an entropy Since we are interestedin predictability or generalization which are associated with some featuresof the signal persisting forever we may assume stationarity or invarianceunder time translations Then the entropy of the past data depends only onthe duration of our observations so we can write iexclhlog2 P(xpast)i D S(T)and by the same argument iexclhlog2 P(xfuture)i D S(T0 ) Finally the entropyof the past and the future taken together is the entropy of observations ona window of duration T C T0 so that iexclhlog2 P(xfuture xpast)i D S(T C T0 )Putting these equations together we obtain

Ipred(T T0 ) D S(T) C S(T0 ) iexcl S(T C T0 ) (33)

It is important to recall that mutual information is a symmetric quantityThus we can view Ipred(T T0 ) as either the information that a data segmentof duration T provides about the future of length T0 or the information thata data segment of duration T0 provides about the immediate past of dura-tion T This is a direct application of the denition of information but seemscounterintuitive Shouldnrsquot it be more difcult to predict than to postdictOne can perhaps recover the correct intuition by thinking about a large en-semble of question-answer pairs Prediction corresponds to generating theanswer to a given question while postdiction corresponds to generatingthe question that goes with a given answer We know that guessing ques-tions given answers is also a hard problem3 and can be just as challenging

3 This is the basis of the popular American television game show Jeopardy widelyviewed as the most ldquointellectualrdquo of its genre

Predictability Complexity and Learning 2417

as the more conventional problem of answering the questions themselvesOur focus here is on prediction because we want to make connections withthe phenomenon of generalization in learning but it is clear that generat-ing a consistent interpretation of observed data may involve elements ofboth prediction and postdiction (see for example Eagleman amp Sejnowski2000) it is attractive that the information-theoretic formulation treats theseproblems symmetrically

In the same way that the entropy of a gas at xed density is proportionalto the volume the entropy of a time series (asymptotically) is proportional toits duration so that limT1 S(T)T D S0 entropy is an extensive quantityBut from Equation 33 any extensive component of the entropy cancels inthe computation of the predictive information predictability is a deviationfrom extensivity If we write S(T) D S0T C S1(T) then equation 33 tellsus that the predictive information is related only to the nonextensive termS1(T) Note that if we are observing a deterministic system then S0 D 0 butthis is independent of questions about the structure of the subextensive termS1(T) It is attractive that information theory gives us a unied discussionof prediction in deterministic and probabilistic cases

We know two general facts about the behavior of S1(T) First the correc-tions to extensive behavior are positive S1(T) cedil 0 Second the statementthat entropy is extensive is the statement that the limit

limT1

S(T)

TD S0 (34)

exists and for this to be true we must also have

limT1

S1(T)T

D 0 (35)

Thus the nonextensive terms in the entropy must be subextensive that isthey must grow with T less rapidly than a linear function Taken togetherthese facts guarantee that the predictive information is positive and subex-tensive Furthermore if we let the future extend forward for a very longtime T0 1 then we can measure the information that our sample pro-vides about the entire future

Ipred(T) D limT01

Ipred(T T0 ) D S1(T) (36)

Similarly instead of increasing the duration of the future to innity wecould have considered the mutual information between a sample of lengthT and all of the innite past Then the postdictive information also is equal toS1(T) and the symmetry between prediction and postdiction is even moreprofound not only is there symmetry between questions and answers butobservations on a given period of time provide the same amount of infor-mation about the historical path that led to our observations as about the

2418 W Bialek I Nemenman and N Tishby

future that will unfold from them In some cases this statement becomeseven stronger For example if the subextensive entropy of a long discon-tinuous observation of a total length T with a gap of a duration dT iquest T isequal to S1(T) C O(dT

T ) then the subextensive entropy of the present is notonly its information about the past or the future but also the informationabout the past and the future

If we have been observing a time series for a (long) time T then the totalamount of data we have collected is measured by the entropy S(T) and atlarge T this is given approximately by S0T But the predictive informationthat we have gathered cannot grow linearly with time even if we are makingpredictions about a future that stretches out to innity As a result of thetotal information we have taken in by observing xpast only a vanishingfraction is of relevance to the prediction

limT1

Predictive informationTotal information

DIpred(T)

S(T) 0 (37)

In this precise sense most of what we observe is irrelevant to the problemof predicting the future4

Consider the case where time is measured in discrete steps so that wehave seen N time pointsx1 x2 xN How much have we learned about theunderlying pattern in these data The more we know the more effectivelywe can predict the next data point xNC1 and hence the fewer bits we willneed to describe the deviation of this data point from our prediction Ouraccumulated knowledge about the time series is measured by the degree towhich we can compress the description of new observations On averagethe length of the code word required to describe the point xNC1 given thatwe have seen the previous N points is given by

(N) D iexclhlog2 P(xNC1 | x1 x2 xN)i bits (38)

where the expectation value is taken over the joint distribution of all theN C 1 points P(x1 x2 xN xNC1) It is easy to see that

(N) D S(N C 1) iexcl S(N) frac14 S(N)

N (39)

As we observe for longer times we learn more and this word length de-creases It is natural to dene a learning curve that measures this improve-ment Usually we dene learning curves by measuring the frequency or

4 We can think of equation 37 asa law of diminishing returns Although we collect datain proportion to our observation time T a smaller and smaller fraction of this informationis useful in the problem of prediction These diminishing returns are not due to a limitedlifetime since we calculate the predictive information assuming that we have a futureextending forward to innity A senior colleague points out that this is an argument forchanging elds before becoming too expert

Predictability Complexity and Learning 2419

costs of errors here the cost is that our encoding of the point xNC1 is longerthan it could be if we had perfect knowledge This ideal encoding has alength that we can nd by imagining that we observe the time series for aninnitely long time ideal D limN1 (N) but this is just another way ofdening the extensive component of the entropy S0 Thus we can dene alearning curve

L (N) acute (N) iexcl ideal (310)

D S(N C 1) iexcl S(N) iexcl S0

D S1(N C 1) iexcl S1(N)

frac14 S1(N)

ND

Ipred(N)

N (311)

and we see once again that the extensive component of the entropy cancelsIt is well known that the problems of prediction and compression are

related and what we have done here is to illustrate one aspect of this con-nection Specically if we ask how much one segment of a time series cantell us about the future the answer is contained in the subextensive behaviorof the entropy If we ask how much we are learning about the structure ofthe time series then the natural and universally dened learning curve is re-lated again to the subextensive entropy the learning curve is the derivativeof the predictive information

This universal learning curve is connected to the more conventionallearning curves in specic contexts As an example (cf section 41) considertting a set of data points fxn yng with some class of functions y D f (xI reg)where reg are unknown parameters that need to be learned we also allow forsome gaussian noise in our observation of the yn Here the natural learningcurve is the evolution of Acirc

2 for generalization as a function of the number ofexamples Within the approximations discussed below it is straightforwardto show that as N becomes large

DAcirc

2(N)E

D1

s2

Dpoundy iexcl f (xI reg)

curren2E (2 ln 2) L (N) C 1 (312)

where s2 is the variance of the noise Thus a more conventional measure

of performance at learning a function is equal to the universal learningcurve dened purely by information-theoretic criteria In other words if alearning curve is measured in the right units then its integral represents theamount of the useful information accumulated Then the subextensivity ofS1 guarantees that the learning curve decreases to zero as N 1

Different quantities related to the subextensive entropy have been dis-cussed in several contexts For example the code length (N) has beendened as a learning curve in the specic case of neural networks (Opperamp Haussler 1995) and has been termed the ldquothermodynamic diverdquo (Crutch-eld amp Shalizi 1999) and ldquoNth order block entropyrdquo (Grassberger 1986)

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 4: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2412 W Bialek I Nemenman and N Tishby

power law or nonparametric class of processes may be crucial in real-worldlearning tasks where the effective number of parameters becomes so largethat asymptotic results for nitely parameterizable models are inaccessiblein practice There is empirical evidence that simple physical systems cangenerate dynamics in this complexityclass and there are hints that languagealso may fall in this class

Finally we argue that the divergent components of the predictive in-formation provide a unique measure of complexity that is consistent withcertain simple requirements This argument is in the spirit of Shannonrsquosorig-inal derivation of entropy as the unique measure of available informationWe believe that this uniqueness argument provides a conclusive answerto the question of how one should quantify the complexity of a processgenerating a time series

With the evident cost of lengthening our discussion we have tried togive a self-contained presentation that develops our point of view usessimple examples to connect with known results and then generalizes andgoes beyond these results2 Even in cases where at least the qualitative formof our results is known from previous work we believe that our point ofview elucidates some issues that may have been less the focus of earlierstudies Last but not least we explore the possibilities for connecting ourtheoretical discussion with the experimental characterization of learningand complexity in neural systems

2 A Curious Observation

Before starting the systematic analysis of the problem we want to motivateour discussion further by presenting results of some simple numerical ex-periments Since most of the article draws examples from learning here weconsider examples from equilibrium statistical mechanics Suppose that wehave a one-dimensional chain of Ising spins with the Hamiltonian given by

H D iexclX

i jJijsisj (21)

where the matrix of interactions Jij is not restricted to nearest neighborslong-range interactions are also allowed One may identify spins pointingupward with 1 and downward with 0 and then a spin chain is equivalentto some sequence of binary digits This sequence consists of (overlapping)words of N digits each Wk k D 0 1 cent cent cent 2N iexcl1 There are 2N such words totaland they appear with very different frequencies n(Wk) in the spin chain (see

2 Some of the basic ideas presented here together with some connections to earlierwork can be found in brief preliminary reports (Bialek 1995 Bialek amp Tishby 1999)The central results of this work however were at best conjectures in these preliminaryaccounts

Predictability Complexity and Learning 2413

7W

1 1 1 11 1 10 0 0 0 1 0 0 0 0 0 0

W 1W 0 W 1 W 9 W 0

W = 0 0 0 0

W = 0 0 0 1

W = 0 0 1 0

W = 1 1 1 1

0

1

2

15

Figure 1 Calculating entropy of words of length 4 in a chain of 17 spins Forthis chain n(W0) D n(W1 ) D n(W3 ) D n(W7 ) D n(W12) D n(W14 ) D 2 n(W8 ) Dn(W9 ) D 1 and all other frequencies are zero Thus S(4) frac14 295 bits

Figure 1 for details) If the number of spins is large then counting thesefrequencies provides a good empirical estimate of PN (Wk) the probabil-ity distribution of different words of length N Then one can calculate theentropy S(N) of this probability distribution by the usual formula

S(N) D iexcl2N iexcl1X

kD0

PN(Wk) log2 PN(Wk) (bits) (22)

Note that this is not the entropy of a nite chain with length N instead itis the entropy of words or strings with length N drawn from a much longerchain Nonetheless since entropy is an extensive property S(N) is propor-tional asymptotically to N for any spin chain that is S(N) frac14 S0 cent N Theusual goal in statistical mechanics is to understand this ldquothermodynamiclimitrdquo N 1 and hence to calculate the entropy density S0 Different setsof interactions Jij result in different values of S0 but the qualitative resultS(N) N is true for all reasonable fJijg

We investigated three different spin chains of 1 billion spins each Asusual in statistical mechanics the probability of any conguration of spinsfsig is given by the Boltzmann distribution

P[fsig] exp(iexclH kBT) (23)

where to normalize the scale of the Jij we set kBT D 1 For the rst chainonly JiiC1 D 1 was nonzero and its value was the same for all i The secondchain was also generated using the nearest-neighbor interactions but thevalue of the coupling was reset every 400000 spins by taking a randomnumber from a gaussian distribution with zero mean and unit varianceIn the third case we again reset the interactions at the same frequencybut now interactions were long-ranged the variances of coupling constantsdecreased with the distance between the spins as hJ2

iji D 1(i iexcl j)2 We plotS(N) for all these cases in Figure 2 and of course the asymptotically linear

2414 W Bialek I Nemenman and N Tishby

5 10 15 20 250

5

10

15

20

25

N

Sfixed J variable J short range interactions variable Jrsquos long range decaying interactions

Figure 2 Entropy as a function of the word length for spin chains with differentinteractions Notice that all lines start from S(N) D log2 2 D 1 since at the valuesof the coupling we investigated the correlation length is much smaller than thechain length (1 cent 109 spins)

behavior is evidentmdashthe extensive entropy shows no qualitative distinctionamong the three cases we consider

The situation changes drastically if we remove the asymptotic linear con-tribution and focus on the corrections to extensive behavior Specically wewrite S(N) D S0 centN C S1(N) and plot only the sublinear component S1(N) ofthe entropy As we see in Figure 3 the three chains then exhibit qualitativelydifferent features for the rst one S1 is constant for the second one it islogarithmic and for the third one it clearly shows a power-law behavior

What is the signicance of these observations Of course the differencesin the behavior of S1(N) must be related to the ways we chose Jrsquos for thesimulations In the rst case J is xed and if we see N spins and try to predictthe state of the N C 1st spin all that really matters is the state of the spinsN there is nothing to ldquolearnrdquo from observations on longer segments of thechain For the second chain J changes and the statistics of the spin wordsare different in different parts of the sequence By looking at these statisticsone can ldquolearnrdquo the coupling at the current position this estimate improvesthe more spins (longer words) we observe Finally in the third case there aremany coupling constants that can be learned As N increases one becomessensitive to weaker correlationscaused by interactions over larger and larger

Predictability Complexity and Learning 2415

5 10 15 20 250

1

2

3

4

5

6

7

S1=const

S1=const

1+12 log N

S1=const

1+const

2 N05

N

S1

fixed J variable J short range interactions variable Jrsquos long range decaying interactionsfits

Figure 3 Subextensive part of the entropy as a function of the word length

distances So intuitively the qualitatively different behaviors of S1(N) in thethree plotted cases are correlated with differences in the problem of learningthe underlying dynamics of the spin chains from observations on samplesof the spins themselves Much of this article can be seen as expanding onand quantifying this intuitive observation

3 Fundamentals

The problem of prediction comes in various forms as noted in the Intro-duction Information theory allows us to treat the different notions of pre-diction on the same footing The rst step is to recognize that all predictionsare probabilistic Even if we can predict the temperature at noon tomorrowwe should provide error bars or condence limits on our prediction Thenext step is to remember that even before we look at the data we knowthat certain futures are more likely than others and we can summarize thisknowledge by a prior probability distribution for the future Our observa-tions on the past lead us to a new more tightly concentrated distributionthe distribution of futures conditional on the past data Different kinds ofpredictions are different slices through or averages over this conditionaldistribution but information theory quanties the ldquoconcentrationrdquo of thedistribution without making any commitment as to which averages will bemost interesting

2416 W Bialek I Nemenman and N Tishby

Imagine that we observe a stream of data x(t) over a time interval iexclT ltt lt 0 Let all of these past data be denoted by the shorthand xpast We areinterested in saying something about the future so we want to know aboutthe data x(t) that will be observed in the time interval 0 lt t lt T0 let these fu-ture data be called xfuture In the absence of any other knowledge futures aredrawn from the probability distribution P(xfuture) and observations of par-ticular past data xpast tell us that futures will be drawn from the conditionaldistribution P(xfuture|xpast) The greater concentration of the conditional dis-tribution can be quantied by the fact that it has smaller entropy than theprior distribution and this reduction in entropy is Shannonrsquos denition ofthe information that the past provides about the future We can write theaverage of this predictive information as

Ipred(T T 0 ) Dfrac12log2

micro P(xfuture|xpast)

P(xfuture)

para frac34(31)

D iexclhlog2 P(xfuture)i iexcl hlog2 P(xpast)iiexcl poundiexclhlog2 P(xfuture xpast)i

curren (32)

where hcent cent centi denotes an average over the joint distribution of the past andthe future P(xfuture xpast)

Each of the terms in equation 32 is an entropy Since we are interestedin predictability or generalization which are associated with some featuresof the signal persisting forever we may assume stationarity or invarianceunder time translations Then the entropy of the past data depends only onthe duration of our observations so we can write iexclhlog2 P(xpast)i D S(T)and by the same argument iexclhlog2 P(xfuture)i D S(T0 ) Finally the entropyof the past and the future taken together is the entropy of observations ona window of duration T C T0 so that iexclhlog2 P(xfuture xpast)i D S(T C T0 )Putting these equations together we obtain

Ipred(T T0 ) D S(T) C S(T0 ) iexcl S(T C T0 ) (33)

It is important to recall that mutual information is a symmetric quantityThus we can view Ipred(T T0 ) as either the information that a data segmentof duration T provides about the future of length T0 or the information thata data segment of duration T0 provides about the immediate past of dura-tion T This is a direct application of the denition of information but seemscounterintuitive Shouldnrsquot it be more difcult to predict than to postdictOne can perhaps recover the correct intuition by thinking about a large en-semble of question-answer pairs Prediction corresponds to generating theanswer to a given question while postdiction corresponds to generatingthe question that goes with a given answer We know that guessing ques-tions given answers is also a hard problem3 and can be just as challenging

3 This is the basis of the popular American television game show Jeopardy widelyviewed as the most ldquointellectualrdquo of its genre

Predictability Complexity and Learning 2417

as the more conventional problem of answering the questions themselvesOur focus here is on prediction because we want to make connections withthe phenomenon of generalization in learning but it is clear that generat-ing a consistent interpretation of observed data may involve elements ofboth prediction and postdiction (see for example Eagleman amp Sejnowski2000) it is attractive that the information-theoretic formulation treats theseproblems symmetrically

In the same way that the entropy of a gas at xed density is proportionalto the volume the entropy of a time series (asymptotically) is proportional toits duration so that limT1 S(T)T D S0 entropy is an extensive quantityBut from Equation 33 any extensive component of the entropy cancels inthe computation of the predictive information predictability is a deviationfrom extensivity If we write S(T) D S0T C S1(T) then equation 33 tellsus that the predictive information is related only to the nonextensive termS1(T) Note that if we are observing a deterministic system then S0 D 0 butthis is independent of questions about the structure of the subextensive termS1(T) It is attractive that information theory gives us a unied discussionof prediction in deterministic and probabilistic cases

We know two general facts about the behavior of S1(T) First the correc-tions to extensive behavior are positive S1(T) cedil 0 Second the statementthat entropy is extensive is the statement that the limit

limT1

S(T)

TD S0 (34)

exists and for this to be true we must also have

limT1

S1(T)T

D 0 (35)

Thus the nonextensive terms in the entropy must be subextensive that isthey must grow with T less rapidly than a linear function Taken togetherthese facts guarantee that the predictive information is positive and subex-tensive Furthermore if we let the future extend forward for a very longtime T0 1 then we can measure the information that our sample pro-vides about the entire future

Ipred(T) D limT01

Ipred(T T0 ) D S1(T) (36)

Similarly instead of increasing the duration of the future to innity wecould have considered the mutual information between a sample of lengthT and all of the innite past Then the postdictive information also is equal toS1(T) and the symmetry between prediction and postdiction is even moreprofound not only is there symmetry between questions and answers butobservations on a given period of time provide the same amount of infor-mation about the historical path that led to our observations as about the

2418 W Bialek I Nemenman and N Tishby

future that will unfold from them In some cases this statement becomeseven stronger For example if the subextensive entropy of a long discon-tinuous observation of a total length T with a gap of a duration dT iquest T isequal to S1(T) C O(dT

T ) then the subextensive entropy of the present is notonly its information about the past or the future but also the informationabout the past and the future

If we have been observing a time series for a (long) time T then the totalamount of data we have collected is measured by the entropy S(T) and atlarge T this is given approximately by S0T But the predictive informationthat we have gathered cannot grow linearly with time even if we are makingpredictions about a future that stretches out to innity As a result of thetotal information we have taken in by observing xpast only a vanishingfraction is of relevance to the prediction

limT1

Predictive informationTotal information

DIpred(T)

S(T) 0 (37)

In this precise sense most of what we observe is irrelevant to the problemof predicting the future4

Consider the case where time is measured in discrete steps so that wehave seen N time pointsx1 x2 xN How much have we learned about theunderlying pattern in these data The more we know the more effectivelywe can predict the next data point xNC1 and hence the fewer bits we willneed to describe the deviation of this data point from our prediction Ouraccumulated knowledge about the time series is measured by the degree towhich we can compress the description of new observations On averagethe length of the code word required to describe the point xNC1 given thatwe have seen the previous N points is given by

(N) D iexclhlog2 P(xNC1 | x1 x2 xN)i bits (38)

where the expectation value is taken over the joint distribution of all theN C 1 points P(x1 x2 xN xNC1) It is easy to see that

(N) D S(N C 1) iexcl S(N) frac14 S(N)

N (39)

As we observe for longer times we learn more and this word length de-creases It is natural to dene a learning curve that measures this improve-ment Usually we dene learning curves by measuring the frequency or

4 We can think of equation 37 asa law of diminishing returns Although we collect datain proportion to our observation time T a smaller and smaller fraction of this informationis useful in the problem of prediction These diminishing returns are not due to a limitedlifetime since we calculate the predictive information assuming that we have a futureextending forward to innity A senior colleague points out that this is an argument forchanging elds before becoming too expert

Predictability Complexity and Learning 2419

costs of errors here the cost is that our encoding of the point xNC1 is longerthan it could be if we had perfect knowledge This ideal encoding has alength that we can nd by imagining that we observe the time series for aninnitely long time ideal D limN1 (N) but this is just another way ofdening the extensive component of the entropy S0 Thus we can dene alearning curve

L (N) acute (N) iexcl ideal (310)

D S(N C 1) iexcl S(N) iexcl S0

D S1(N C 1) iexcl S1(N)

frac14 S1(N)

ND

Ipred(N)

N (311)

and we see once again that the extensive component of the entropy cancelsIt is well known that the problems of prediction and compression are

related and what we have done here is to illustrate one aspect of this con-nection Specically if we ask how much one segment of a time series cantell us about the future the answer is contained in the subextensive behaviorof the entropy If we ask how much we are learning about the structure ofthe time series then the natural and universally dened learning curve is re-lated again to the subextensive entropy the learning curve is the derivativeof the predictive information

This universal learning curve is connected to the more conventionallearning curves in specic contexts As an example (cf section 41) considertting a set of data points fxn yng with some class of functions y D f (xI reg)where reg are unknown parameters that need to be learned we also allow forsome gaussian noise in our observation of the yn Here the natural learningcurve is the evolution of Acirc

2 for generalization as a function of the number ofexamples Within the approximations discussed below it is straightforwardto show that as N becomes large

DAcirc

2(N)E

D1

s2

Dpoundy iexcl f (xI reg)

curren2E (2 ln 2) L (N) C 1 (312)

where s2 is the variance of the noise Thus a more conventional measure

of performance at learning a function is equal to the universal learningcurve dened purely by information-theoretic criteria In other words if alearning curve is measured in the right units then its integral represents theamount of the useful information accumulated Then the subextensivity ofS1 guarantees that the learning curve decreases to zero as N 1

Different quantities related to the subextensive entropy have been dis-cussed in several contexts For example the code length (N) has beendened as a learning curve in the specic case of neural networks (Opperamp Haussler 1995) and has been termed the ldquothermodynamic diverdquo (Crutch-eld amp Shalizi 1999) and ldquoNth order block entropyrdquo (Grassberger 1986)

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 5: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2413

7W

1 1 1 11 1 10 0 0 0 1 0 0 0 0 0 0

W 1W 0 W 1 W 9 W 0

W = 0 0 0 0

W = 0 0 0 1

W = 0 0 1 0

W = 1 1 1 1

0

1

2

15

Figure 1 Calculating entropy of words of length 4 in a chain of 17 spins Forthis chain n(W0) D n(W1 ) D n(W3 ) D n(W7 ) D n(W12) D n(W14 ) D 2 n(W8 ) Dn(W9 ) D 1 and all other frequencies are zero Thus S(4) frac14 295 bits

Figure 1 for details) If the number of spins is large then counting thesefrequencies provides a good empirical estimate of PN (Wk) the probabil-ity distribution of different words of length N Then one can calculate theentropy S(N) of this probability distribution by the usual formula

S(N) D iexcl2N iexcl1X

kD0

PN(Wk) log2 PN(Wk) (bits) (22)

Note that this is not the entropy of a nite chain with length N instead itis the entropy of words or strings with length N drawn from a much longerchain Nonetheless since entropy is an extensive property S(N) is propor-tional asymptotically to N for any spin chain that is S(N) frac14 S0 cent N Theusual goal in statistical mechanics is to understand this ldquothermodynamiclimitrdquo N 1 and hence to calculate the entropy density S0 Different setsof interactions Jij result in different values of S0 but the qualitative resultS(N) N is true for all reasonable fJijg

We investigated three different spin chains of 1 billion spins each Asusual in statistical mechanics the probability of any conguration of spinsfsig is given by the Boltzmann distribution

P[fsig] exp(iexclH kBT) (23)

where to normalize the scale of the Jij we set kBT D 1 For the rst chainonly JiiC1 D 1 was nonzero and its value was the same for all i The secondchain was also generated using the nearest-neighbor interactions but thevalue of the coupling was reset every 400000 spins by taking a randomnumber from a gaussian distribution with zero mean and unit varianceIn the third case we again reset the interactions at the same frequencybut now interactions were long-ranged the variances of coupling constantsdecreased with the distance between the spins as hJ2

iji D 1(i iexcl j)2 We plotS(N) for all these cases in Figure 2 and of course the asymptotically linear

2414 W Bialek I Nemenman and N Tishby

5 10 15 20 250

5

10

15

20

25

N

Sfixed J variable J short range interactions variable Jrsquos long range decaying interactions

Figure 2 Entropy as a function of the word length for spin chains with differentinteractions Notice that all lines start from S(N) D log2 2 D 1 since at the valuesof the coupling we investigated the correlation length is much smaller than thechain length (1 cent 109 spins)

behavior is evidentmdashthe extensive entropy shows no qualitative distinctionamong the three cases we consider

The situation changes drastically if we remove the asymptotic linear con-tribution and focus on the corrections to extensive behavior Specically wewrite S(N) D S0 centN C S1(N) and plot only the sublinear component S1(N) ofthe entropy As we see in Figure 3 the three chains then exhibit qualitativelydifferent features for the rst one S1 is constant for the second one it islogarithmic and for the third one it clearly shows a power-law behavior

What is the signicance of these observations Of course the differencesin the behavior of S1(N) must be related to the ways we chose Jrsquos for thesimulations In the rst case J is xed and if we see N spins and try to predictthe state of the N C 1st spin all that really matters is the state of the spinsN there is nothing to ldquolearnrdquo from observations on longer segments of thechain For the second chain J changes and the statistics of the spin wordsare different in different parts of the sequence By looking at these statisticsone can ldquolearnrdquo the coupling at the current position this estimate improvesthe more spins (longer words) we observe Finally in the third case there aremany coupling constants that can be learned As N increases one becomessensitive to weaker correlationscaused by interactions over larger and larger

Predictability Complexity and Learning 2415

5 10 15 20 250

1

2

3

4

5

6

7

S1=const

S1=const

1+12 log N

S1=const

1+const

2 N05

N

S1

fixed J variable J short range interactions variable Jrsquos long range decaying interactionsfits

Figure 3 Subextensive part of the entropy as a function of the word length

distances So intuitively the qualitatively different behaviors of S1(N) in thethree plotted cases are correlated with differences in the problem of learningthe underlying dynamics of the spin chains from observations on samplesof the spins themselves Much of this article can be seen as expanding onand quantifying this intuitive observation

3 Fundamentals

The problem of prediction comes in various forms as noted in the Intro-duction Information theory allows us to treat the different notions of pre-diction on the same footing The rst step is to recognize that all predictionsare probabilistic Even if we can predict the temperature at noon tomorrowwe should provide error bars or condence limits on our prediction Thenext step is to remember that even before we look at the data we knowthat certain futures are more likely than others and we can summarize thisknowledge by a prior probability distribution for the future Our observa-tions on the past lead us to a new more tightly concentrated distributionthe distribution of futures conditional on the past data Different kinds ofpredictions are different slices through or averages over this conditionaldistribution but information theory quanties the ldquoconcentrationrdquo of thedistribution without making any commitment as to which averages will bemost interesting

2416 W Bialek I Nemenman and N Tishby

Imagine that we observe a stream of data x(t) over a time interval iexclT ltt lt 0 Let all of these past data be denoted by the shorthand xpast We areinterested in saying something about the future so we want to know aboutthe data x(t) that will be observed in the time interval 0 lt t lt T0 let these fu-ture data be called xfuture In the absence of any other knowledge futures aredrawn from the probability distribution P(xfuture) and observations of par-ticular past data xpast tell us that futures will be drawn from the conditionaldistribution P(xfuture|xpast) The greater concentration of the conditional dis-tribution can be quantied by the fact that it has smaller entropy than theprior distribution and this reduction in entropy is Shannonrsquos denition ofthe information that the past provides about the future We can write theaverage of this predictive information as

Ipred(T T 0 ) Dfrac12log2

micro P(xfuture|xpast)

P(xfuture)

para frac34(31)

D iexclhlog2 P(xfuture)i iexcl hlog2 P(xpast)iiexcl poundiexclhlog2 P(xfuture xpast)i

curren (32)

where hcent cent centi denotes an average over the joint distribution of the past andthe future P(xfuture xpast)

Each of the terms in equation 32 is an entropy Since we are interestedin predictability or generalization which are associated with some featuresof the signal persisting forever we may assume stationarity or invarianceunder time translations Then the entropy of the past data depends only onthe duration of our observations so we can write iexclhlog2 P(xpast)i D S(T)and by the same argument iexclhlog2 P(xfuture)i D S(T0 ) Finally the entropyof the past and the future taken together is the entropy of observations ona window of duration T C T0 so that iexclhlog2 P(xfuture xpast)i D S(T C T0 )Putting these equations together we obtain

Ipred(T T0 ) D S(T) C S(T0 ) iexcl S(T C T0 ) (33)

It is important to recall that mutual information is a symmetric quantityThus we can view Ipred(T T0 ) as either the information that a data segmentof duration T provides about the future of length T0 or the information thata data segment of duration T0 provides about the immediate past of dura-tion T This is a direct application of the denition of information but seemscounterintuitive Shouldnrsquot it be more difcult to predict than to postdictOne can perhaps recover the correct intuition by thinking about a large en-semble of question-answer pairs Prediction corresponds to generating theanswer to a given question while postdiction corresponds to generatingthe question that goes with a given answer We know that guessing ques-tions given answers is also a hard problem3 and can be just as challenging

3 This is the basis of the popular American television game show Jeopardy widelyviewed as the most ldquointellectualrdquo of its genre

Predictability Complexity and Learning 2417

as the more conventional problem of answering the questions themselvesOur focus here is on prediction because we want to make connections withthe phenomenon of generalization in learning but it is clear that generat-ing a consistent interpretation of observed data may involve elements ofboth prediction and postdiction (see for example Eagleman amp Sejnowski2000) it is attractive that the information-theoretic formulation treats theseproblems symmetrically

In the same way that the entropy of a gas at xed density is proportionalto the volume the entropy of a time series (asymptotically) is proportional toits duration so that limT1 S(T)T D S0 entropy is an extensive quantityBut from Equation 33 any extensive component of the entropy cancels inthe computation of the predictive information predictability is a deviationfrom extensivity If we write S(T) D S0T C S1(T) then equation 33 tellsus that the predictive information is related only to the nonextensive termS1(T) Note that if we are observing a deterministic system then S0 D 0 butthis is independent of questions about the structure of the subextensive termS1(T) It is attractive that information theory gives us a unied discussionof prediction in deterministic and probabilistic cases

We know two general facts about the behavior of S1(T) First the correc-tions to extensive behavior are positive S1(T) cedil 0 Second the statementthat entropy is extensive is the statement that the limit

limT1

S(T)

TD S0 (34)

exists and for this to be true we must also have

limT1

S1(T)T

D 0 (35)

Thus the nonextensive terms in the entropy must be subextensive that isthey must grow with T less rapidly than a linear function Taken togetherthese facts guarantee that the predictive information is positive and subex-tensive Furthermore if we let the future extend forward for a very longtime T0 1 then we can measure the information that our sample pro-vides about the entire future

Ipred(T) D limT01

Ipred(T T0 ) D S1(T) (36)

Similarly instead of increasing the duration of the future to innity wecould have considered the mutual information between a sample of lengthT and all of the innite past Then the postdictive information also is equal toS1(T) and the symmetry between prediction and postdiction is even moreprofound not only is there symmetry between questions and answers butobservations on a given period of time provide the same amount of infor-mation about the historical path that led to our observations as about the

2418 W Bialek I Nemenman and N Tishby

future that will unfold from them In some cases this statement becomeseven stronger For example if the subextensive entropy of a long discon-tinuous observation of a total length T with a gap of a duration dT iquest T isequal to S1(T) C O(dT

T ) then the subextensive entropy of the present is notonly its information about the past or the future but also the informationabout the past and the future

If we have been observing a time series for a (long) time T then the totalamount of data we have collected is measured by the entropy S(T) and atlarge T this is given approximately by S0T But the predictive informationthat we have gathered cannot grow linearly with time even if we are makingpredictions about a future that stretches out to innity As a result of thetotal information we have taken in by observing xpast only a vanishingfraction is of relevance to the prediction

limT1

Predictive informationTotal information

DIpred(T)

S(T) 0 (37)

In this precise sense most of what we observe is irrelevant to the problemof predicting the future4

Consider the case where time is measured in discrete steps so that wehave seen N time pointsx1 x2 xN How much have we learned about theunderlying pattern in these data The more we know the more effectivelywe can predict the next data point xNC1 and hence the fewer bits we willneed to describe the deviation of this data point from our prediction Ouraccumulated knowledge about the time series is measured by the degree towhich we can compress the description of new observations On averagethe length of the code word required to describe the point xNC1 given thatwe have seen the previous N points is given by

(N) D iexclhlog2 P(xNC1 | x1 x2 xN)i bits (38)

where the expectation value is taken over the joint distribution of all theN C 1 points P(x1 x2 xN xNC1) It is easy to see that

(N) D S(N C 1) iexcl S(N) frac14 S(N)

N (39)

As we observe for longer times we learn more and this word length de-creases It is natural to dene a learning curve that measures this improve-ment Usually we dene learning curves by measuring the frequency or

4 We can think of equation 37 asa law of diminishing returns Although we collect datain proportion to our observation time T a smaller and smaller fraction of this informationis useful in the problem of prediction These diminishing returns are not due to a limitedlifetime since we calculate the predictive information assuming that we have a futureextending forward to innity A senior colleague points out that this is an argument forchanging elds before becoming too expert

Predictability Complexity and Learning 2419

costs of errors here the cost is that our encoding of the point xNC1 is longerthan it could be if we had perfect knowledge This ideal encoding has alength that we can nd by imagining that we observe the time series for aninnitely long time ideal D limN1 (N) but this is just another way ofdening the extensive component of the entropy S0 Thus we can dene alearning curve

L (N) acute (N) iexcl ideal (310)

D S(N C 1) iexcl S(N) iexcl S0

D S1(N C 1) iexcl S1(N)

frac14 S1(N)

ND

Ipred(N)

N (311)

and we see once again that the extensive component of the entropy cancelsIt is well known that the problems of prediction and compression are

related and what we have done here is to illustrate one aspect of this con-nection Specically if we ask how much one segment of a time series cantell us about the future the answer is contained in the subextensive behaviorof the entropy If we ask how much we are learning about the structure ofthe time series then the natural and universally dened learning curve is re-lated again to the subextensive entropy the learning curve is the derivativeof the predictive information

This universal learning curve is connected to the more conventionallearning curves in specic contexts As an example (cf section 41) considertting a set of data points fxn yng with some class of functions y D f (xI reg)where reg are unknown parameters that need to be learned we also allow forsome gaussian noise in our observation of the yn Here the natural learningcurve is the evolution of Acirc

2 for generalization as a function of the number ofexamples Within the approximations discussed below it is straightforwardto show that as N becomes large

DAcirc

2(N)E

D1

s2

Dpoundy iexcl f (xI reg)

curren2E (2 ln 2) L (N) C 1 (312)

where s2 is the variance of the noise Thus a more conventional measure

of performance at learning a function is equal to the universal learningcurve dened purely by information-theoretic criteria In other words if alearning curve is measured in the right units then its integral represents theamount of the useful information accumulated Then the subextensivity ofS1 guarantees that the learning curve decreases to zero as N 1

Different quantities related to the subextensive entropy have been dis-cussed in several contexts For example the code length (N) has beendened as a learning curve in the specic case of neural networks (Opperamp Haussler 1995) and has been termed the ldquothermodynamic diverdquo (Crutch-eld amp Shalizi 1999) and ldquoNth order block entropyrdquo (Grassberger 1986)

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 6: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2414 W Bialek I Nemenman and N Tishby

5 10 15 20 250

5

10

15

20

25

N

Sfixed J variable J short range interactions variable Jrsquos long range decaying interactions

Figure 2 Entropy as a function of the word length for spin chains with differentinteractions Notice that all lines start from S(N) D log2 2 D 1 since at the valuesof the coupling we investigated the correlation length is much smaller than thechain length (1 cent 109 spins)

behavior is evidentmdashthe extensive entropy shows no qualitative distinctionamong the three cases we consider

The situation changes drastically if we remove the asymptotic linear con-tribution and focus on the corrections to extensive behavior Specically wewrite S(N) D S0 centN C S1(N) and plot only the sublinear component S1(N) ofthe entropy As we see in Figure 3 the three chains then exhibit qualitativelydifferent features for the rst one S1 is constant for the second one it islogarithmic and for the third one it clearly shows a power-law behavior

What is the signicance of these observations Of course the differencesin the behavior of S1(N) must be related to the ways we chose Jrsquos for thesimulations In the rst case J is xed and if we see N spins and try to predictthe state of the N C 1st spin all that really matters is the state of the spinsN there is nothing to ldquolearnrdquo from observations on longer segments of thechain For the second chain J changes and the statistics of the spin wordsare different in different parts of the sequence By looking at these statisticsone can ldquolearnrdquo the coupling at the current position this estimate improvesthe more spins (longer words) we observe Finally in the third case there aremany coupling constants that can be learned As N increases one becomessensitive to weaker correlationscaused by interactions over larger and larger

Predictability Complexity and Learning 2415

5 10 15 20 250

1

2

3

4

5

6

7

S1=const

S1=const

1+12 log N

S1=const

1+const

2 N05

N

S1

fixed J variable J short range interactions variable Jrsquos long range decaying interactionsfits

Figure 3 Subextensive part of the entropy as a function of the word length

distances So intuitively the qualitatively different behaviors of S1(N) in thethree plotted cases are correlated with differences in the problem of learningthe underlying dynamics of the spin chains from observations on samplesof the spins themselves Much of this article can be seen as expanding onand quantifying this intuitive observation

3 Fundamentals

The problem of prediction comes in various forms as noted in the Intro-duction Information theory allows us to treat the different notions of pre-diction on the same footing The rst step is to recognize that all predictionsare probabilistic Even if we can predict the temperature at noon tomorrowwe should provide error bars or condence limits on our prediction Thenext step is to remember that even before we look at the data we knowthat certain futures are more likely than others and we can summarize thisknowledge by a prior probability distribution for the future Our observa-tions on the past lead us to a new more tightly concentrated distributionthe distribution of futures conditional on the past data Different kinds ofpredictions are different slices through or averages over this conditionaldistribution but information theory quanties the ldquoconcentrationrdquo of thedistribution without making any commitment as to which averages will bemost interesting

2416 W Bialek I Nemenman and N Tishby

Imagine that we observe a stream of data x(t) over a time interval iexclT ltt lt 0 Let all of these past data be denoted by the shorthand xpast We areinterested in saying something about the future so we want to know aboutthe data x(t) that will be observed in the time interval 0 lt t lt T0 let these fu-ture data be called xfuture In the absence of any other knowledge futures aredrawn from the probability distribution P(xfuture) and observations of par-ticular past data xpast tell us that futures will be drawn from the conditionaldistribution P(xfuture|xpast) The greater concentration of the conditional dis-tribution can be quantied by the fact that it has smaller entropy than theprior distribution and this reduction in entropy is Shannonrsquos denition ofthe information that the past provides about the future We can write theaverage of this predictive information as

Ipred(T T 0 ) Dfrac12log2

micro P(xfuture|xpast)

P(xfuture)

para frac34(31)

D iexclhlog2 P(xfuture)i iexcl hlog2 P(xpast)iiexcl poundiexclhlog2 P(xfuture xpast)i

curren (32)

where hcent cent centi denotes an average over the joint distribution of the past andthe future P(xfuture xpast)

Each of the terms in equation 32 is an entropy Since we are interestedin predictability or generalization which are associated with some featuresof the signal persisting forever we may assume stationarity or invarianceunder time translations Then the entropy of the past data depends only onthe duration of our observations so we can write iexclhlog2 P(xpast)i D S(T)and by the same argument iexclhlog2 P(xfuture)i D S(T0 ) Finally the entropyof the past and the future taken together is the entropy of observations ona window of duration T C T0 so that iexclhlog2 P(xfuture xpast)i D S(T C T0 )Putting these equations together we obtain

Ipred(T T0 ) D S(T) C S(T0 ) iexcl S(T C T0 ) (33)

It is important to recall that mutual information is a symmetric quantityThus we can view Ipred(T T0 ) as either the information that a data segmentof duration T provides about the future of length T0 or the information thata data segment of duration T0 provides about the immediate past of dura-tion T This is a direct application of the denition of information but seemscounterintuitive Shouldnrsquot it be more difcult to predict than to postdictOne can perhaps recover the correct intuition by thinking about a large en-semble of question-answer pairs Prediction corresponds to generating theanswer to a given question while postdiction corresponds to generatingthe question that goes with a given answer We know that guessing ques-tions given answers is also a hard problem3 and can be just as challenging

3 This is the basis of the popular American television game show Jeopardy widelyviewed as the most ldquointellectualrdquo of its genre

Predictability Complexity and Learning 2417

as the more conventional problem of answering the questions themselvesOur focus here is on prediction because we want to make connections withthe phenomenon of generalization in learning but it is clear that generat-ing a consistent interpretation of observed data may involve elements ofboth prediction and postdiction (see for example Eagleman amp Sejnowski2000) it is attractive that the information-theoretic formulation treats theseproblems symmetrically

In the same way that the entropy of a gas at xed density is proportionalto the volume the entropy of a time series (asymptotically) is proportional toits duration so that limT1 S(T)T D S0 entropy is an extensive quantityBut from Equation 33 any extensive component of the entropy cancels inthe computation of the predictive information predictability is a deviationfrom extensivity If we write S(T) D S0T C S1(T) then equation 33 tellsus that the predictive information is related only to the nonextensive termS1(T) Note that if we are observing a deterministic system then S0 D 0 butthis is independent of questions about the structure of the subextensive termS1(T) It is attractive that information theory gives us a unied discussionof prediction in deterministic and probabilistic cases

We know two general facts about the behavior of S1(T) First the correc-tions to extensive behavior are positive S1(T) cedil 0 Second the statementthat entropy is extensive is the statement that the limit

limT1

S(T)

TD S0 (34)

exists and for this to be true we must also have

limT1

S1(T)T

D 0 (35)

Thus the nonextensive terms in the entropy must be subextensive that isthey must grow with T less rapidly than a linear function Taken togetherthese facts guarantee that the predictive information is positive and subex-tensive Furthermore if we let the future extend forward for a very longtime T0 1 then we can measure the information that our sample pro-vides about the entire future

Ipred(T) D limT01

Ipred(T T0 ) D S1(T) (36)

Similarly instead of increasing the duration of the future to innity wecould have considered the mutual information between a sample of lengthT and all of the innite past Then the postdictive information also is equal toS1(T) and the symmetry between prediction and postdiction is even moreprofound not only is there symmetry between questions and answers butobservations on a given period of time provide the same amount of infor-mation about the historical path that led to our observations as about the

2418 W Bialek I Nemenman and N Tishby

future that will unfold from them In some cases this statement becomeseven stronger For example if the subextensive entropy of a long discon-tinuous observation of a total length T with a gap of a duration dT iquest T isequal to S1(T) C O(dT

T ) then the subextensive entropy of the present is notonly its information about the past or the future but also the informationabout the past and the future

If we have been observing a time series for a (long) time T then the totalamount of data we have collected is measured by the entropy S(T) and atlarge T this is given approximately by S0T But the predictive informationthat we have gathered cannot grow linearly with time even if we are makingpredictions about a future that stretches out to innity As a result of thetotal information we have taken in by observing xpast only a vanishingfraction is of relevance to the prediction

limT1

Predictive informationTotal information

DIpred(T)

S(T) 0 (37)

In this precise sense most of what we observe is irrelevant to the problemof predicting the future4

Consider the case where time is measured in discrete steps so that wehave seen N time pointsx1 x2 xN How much have we learned about theunderlying pattern in these data The more we know the more effectivelywe can predict the next data point xNC1 and hence the fewer bits we willneed to describe the deviation of this data point from our prediction Ouraccumulated knowledge about the time series is measured by the degree towhich we can compress the description of new observations On averagethe length of the code word required to describe the point xNC1 given thatwe have seen the previous N points is given by

(N) D iexclhlog2 P(xNC1 | x1 x2 xN)i bits (38)

where the expectation value is taken over the joint distribution of all theN C 1 points P(x1 x2 xN xNC1) It is easy to see that

(N) D S(N C 1) iexcl S(N) frac14 S(N)

N (39)

As we observe for longer times we learn more and this word length de-creases It is natural to dene a learning curve that measures this improve-ment Usually we dene learning curves by measuring the frequency or

4 We can think of equation 37 asa law of diminishing returns Although we collect datain proportion to our observation time T a smaller and smaller fraction of this informationis useful in the problem of prediction These diminishing returns are not due to a limitedlifetime since we calculate the predictive information assuming that we have a futureextending forward to innity A senior colleague points out that this is an argument forchanging elds before becoming too expert

Predictability Complexity and Learning 2419

costs of errors here the cost is that our encoding of the point xNC1 is longerthan it could be if we had perfect knowledge This ideal encoding has alength that we can nd by imagining that we observe the time series for aninnitely long time ideal D limN1 (N) but this is just another way ofdening the extensive component of the entropy S0 Thus we can dene alearning curve

L (N) acute (N) iexcl ideal (310)

D S(N C 1) iexcl S(N) iexcl S0

D S1(N C 1) iexcl S1(N)

frac14 S1(N)

ND

Ipred(N)

N (311)

and we see once again that the extensive component of the entropy cancelsIt is well known that the problems of prediction and compression are

related and what we have done here is to illustrate one aspect of this con-nection Specically if we ask how much one segment of a time series cantell us about the future the answer is contained in the subextensive behaviorof the entropy If we ask how much we are learning about the structure ofthe time series then the natural and universally dened learning curve is re-lated again to the subextensive entropy the learning curve is the derivativeof the predictive information

This universal learning curve is connected to the more conventionallearning curves in specic contexts As an example (cf section 41) considertting a set of data points fxn yng with some class of functions y D f (xI reg)where reg are unknown parameters that need to be learned we also allow forsome gaussian noise in our observation of the yn Here the natural learningcurve is the evolution of Acirc

2 for generalization as a function of the number ofexamples Within the approximations discussed below it is straightforwardto show that as N becomes large

DAcirc

2(N)E

D1

s2

Dpoundy iexcl f (xI reg)

curren2E (2 ln 2) L (N) C 1 (312)

where s2 is the variance of the noise Thus a more conventional measure

of performance at learning a function is equal to the universal learningcurve dened purely by information-theoretic criteria In other words if alearning curve is measured in the right units then its integral represents theamount of the useful information accumulated Then the subextensivity ofS1 guarantees that the learning curve decreases to zero as N 1

Different quantities related to the subextensive entropy have been dis-cussed in several contexts For example the code length (N) has beendened as a learning curve in the specic case of neural networks (Opperamp Haussler 1995) and has been termed the ldquothermodynamic diverdquo (Crutch-eld amp Shalizi 1999) and ldquoNth order block entropyrdquo (Grassberger 1986)

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 7: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2415

5 10 15 20 250

1

2

3

4

5

6

7

S1=const

S1=const

1+12 log N

S1=const

1+const

2 N05

N

S1

fixed J variable J short range interactions variable Jrsquos long range decaying interactionsfits

Figure 3 Subextensive part of the entropy as a function of the word length

distances So intuitively the qualitatively different behaviors of S1(N) in thethree plotted cases are correlated with differences in the problem of learningthe underlying dynamics of the spin chains from observations on samplesof the spins themselves Much of this article can be seen as expanding onand quantifying this intuitive observation

3 Fundamentals

The problem of prediction comes in various forms as noted in the Intro-duction Information theory allows us to treat the different notions of pre-diction on the same footing The rst step is to recognize that all predictionsare probabilistic Even if we can predict the temperature at noon tomorrowwe should provide error bars or condence limits on our prediction Thenext step is to remember that even before we look at the data we knowthat certain futures are more likely than others and we can summarize thisknowledge by a prior probability distribution for the future Our observa-tions on the past lead us to a new more tightly concentrated distributionthe distribution of futures conditional on the past data Different kinds ofpredictions are different slices through or averages over this conditionaldistribution but information theory quanties the ldquoconcentrationrdquo of thedistribution without making any commitment as to which averages will bemost interesting

2416 W Bialek I Nemenman and N Tishby

Imagine that we observe a stream of data x(t) over a time interval iexclT ltt lt 0 Let all of these past data be denoted by the shorthand xpast We areinterested in saying something about the future so we want to know aboutthe data x(t) that will be observed in the time interval 0 lt t lt T0 let these fu-ture data be called xfuture In the absence of any other knowledge futures aredrawn from the probability distribution P(xfuture) and observations of par-ticular past data xpast tell us that futures will be drawn from the conditionaldistribution P(xfuture|xpast) The greater concentration of the conditional dis-tribution can be quantied by the fact that it has smaller entropy than theprior distribution and this reduction in entropy is Shannonrsquos denition ofthe information that the past provides about the future We can write theaverage of this predictive information as

Ipred(T T 0 ) Dfrac12log2

micro P(xfuture|xpast)

P(xfuture)

para frac34(31)

D iexclhlog2 P(xfuture)i iexcl hlog2 P(xpast)iiexcl poundiexclhlog2 P(xfuture xpast)i

curren (32)

where hcent cent centi denotes an average over the joint distribution of the past andthe future P(xfuture xpast)

Each of the terms in equation 32 is an entropy Since we are interestedin predictability or generalization which are associated with some featuresof the signal persisting forever we may assume stationarity or invarianceunder time translations Then the entropy of the past data depends only onthe duration of our observations so we can write iexclhlog2 P(xpast)i D S(T)and by the same argument iexclhlog2 P(xfuture)i D S(T0 ) Finally the entropyof the past and the future taken together is the entropy of observations ona window of duration T C T0 so that iexclhlog2 P(xfuture xpast)i D S(T C T0 )Putting these equations together we obtain

Ipred(T T0 ) D S(T) C S(T0 ) iexcl S(T C T0 ) (33)

It is important to recall that mutual information is a symmetric quantityThus we can view Ipred(T T0 ) as either the information that a data segmentof duration T provides about the future of length T0 or the information thata data segment of duration T0 provides about the immediate past of dura-tion T This is a direct application of the denition of information but seemscounterintuitive Shouldnrsquot it be more difcult to predict than to postdictOne can perhaps recover the correct intuition by thinking about a large en-semble of question-answer pairs Prediction corresponds to generating theanswer to a given question while postdiction corresponds to generatingthe question that goes with a given answer We know that guessing ques-tions given answers is also a hard problem3 and can be just as challenging

3 This is the basis of the popular American television game show Jeopardy widelyviewed as the most ldquointellectualrdquo of its genre

Predictability Complexity and Learning 2417

as the more conventional problem of answering the questions themselvesOur focus here is on prediction because we want to make connections withthe phenomenon of generalization in learning but it is clear that generat-ing a consistent interpretation of observed data may involve elements ofboth prediction and postdiction (see for example Eagleman amp Sejnowski2000) it is attractive that the information-theoretic formulation treats theseproblems symmetrically

In the same way that the entropy of a gas at xed density is proportionalto the volume the entropy of a time series (asymptotically) is proportional toits duration so that limT1 S(T)T D S0 entropy is an extensive quantityBut from Equation 33 any extensive component of the entropy cancels inthe computation of the predictive information predictability is a deviationfrom extensivity If we write S(T) D S0T C S1(T) then equation 33 tellsus that the predictive information is related only to the nonextensive termS1(T) Note that if we are observing a deterministic system then S0 D 0 butthis is independent of questions about the structure of the subextensive termS1(T) It is attractive that information theory gives us a unied discussionof prediction in deterministic and probabilistic cases

We know two general facts about the behavior of S1(T) First the correc-tions to extensive behavior are positive S1(T) cedil 0 Second the statementthat entropy is extensive is the statement that the limit

limT1

S(T)

TD S0 (34)

exists and for this to be true we must also have

limT1

S1(T)T

D 0 (35)

Thus the nonextensive terms in the entropy must be subextensive that isthey must grow with T less rapidly than a linear function Taken togetherthese facts guarantee that the predictive information is positive and subex-tensive Furthermore if we let the future extend forward for a very longtime T0 1 then we can measure the information that our sample pro-vides about the entire future

Ipred(T) D limT01

Ipred(T T0 ) D S1(T) (36)

Similarly instead of increasing the duration of the future to innity wecould have considered the mutual information between a sample of lengthT and all of the innite past Then the postdictive information also is equal toS1(T) and the symmetry between prediction and postdiction is even moreprofound not only is there symmetry between questions and answers butobservations on a given period of time provide the same amount of infor-mation about the historical path that led to our observations as about the

2418 W Bialek I Nemenman and N Tishby

future that will unfold from them In some cases this statement becomeseven stronger For example if the subextensive entropy of a long discon-tinuous observation of a total length T with a gap of a duration dT iquest T isequal to S1(T) C O(dT

T ) then the subextensive entropy of the present is notonly its information about the past or the future but also the informationabout the past and the future

If we have been observing a time series for a (long) time T then the totalamount of data we have collected is measured by the entropy S(T) and atlarge T this is given approximately by S0T But the predictive informationthat we have gathered cannot grow linearly with time even if we are makingpredictions about a future that stretches out to innity As a result of thetotal information we have taken in by observing xpast only a vanishingfraction is of relevance to the prediction

limT1

Predictive informationTotal information

DIpred(T)

S(T) 0 (37)

In this precise sense most of what we observe is irrelevant to the problemof predicting the future4

Consider the case where time is measured in discrete steps so that wehave seen N time pointsx1 x2 xN How much have we learned about theunderlying pattern in these data The more we know the more effectivelywe can predict the next data point xNC1 and hence the fewer bits we willneed to describe the deviation of this data point from our prediction Ouraccumulated knowledge about the time series is measured by the degree towhich we can compress the description of new observations On averagethe length of the code word required to describe the point xNC1 given thatwe have seen the previous N points is given by

(N) D iexclhlog2 P(xNC1 | x1 x2 xN)i bits (38)

where the expectation value is taken over the joint distribution of all theN C 1 points P(x1 x2 xN xNC1) It is easy to see that

(N) D S(N C 1) iexcl S(N) frac14 S(N)

N (39)

As we observe for longer times we learn more and this word length de-creases It is natural to dene a learning curve that measures this improve-ment Usually we dene learning curves by measuring the frequency or

4 We can think of equation 37 asa law of diminishing returns Although we collect datain proportion to our observation time T a smaller and smaller fraction of this informationis useful in the problem of prediction These diminishing returns are not due to a limitedlifetime since we calculate the predictive information assuming that we have a futureextending forward to innity A senior colleague points out that this is an argument forchanging elds before becoming too expert

Predictability Complexity and Learning 2419

costs of errors here the cost is that our encoding of the point xNC1 is longerthan it could be if we had perfect knowledge This ideal encoding has alength that we can nd by imagining that we observe the time series for aninnitely long time ideal D limN1 (N) but this is just another way ofdening the extensive component of the entropy S0 Thus we can dene alearning curve

L (N) acute (N) iexcl ideal (310)

D S(N C 1) iexcl S(N) iexcl S0

D S1(N C 1) iexcl S1(N)

frac14 S1(N)

ND

Ipred(N)

N (311)

and we see once again that the extensive component of the entropy cancelsIt is well known that the problems of prediction and compression are

related and what we have done here is to illustrate one aspect of this con-nection Specically if we ask how much one segment of a time series cantell us about the future the answer is contained in the subextensive behaviorof the entropy If we ask how much we are learning about the structure ofthe time series then the natural and universally dened learning curve is re-lated again to the subextensive entropy the learning curve is the derivativeof the predictive information

This universal learning curve is connected to the more conventionallearning curves in specic contexts As an example (cf section 41) considertting a set of data points fxn yng with some class of functions y D f (xI reg)where reg are unknown parameters that need to be learned we also allow forsome gaussian noise in our observation of the yn Here the natural learningcurve is the evolution of Acirc

2 for generalization as a function of the number ofexamples Within the approximations discussed below it is straightforwardto show that as N becomes large

DAcirc

2(N)E

D1

s2

Dpoundy iexcl f (xI reg)

curren2E (2 ln 2) L (N) C 1 (312)

where s2 is the variance of the noise Thus a more conventional measure

of performance at learning a function is equal to the universal learningcurve dened purely by information-theoretic criteria In other words if alearning curve is measured in the right units then its integral represents theamount of the useful information accumulated Then the subextensivity ofS1 guarantees that the learning curve decreases to zero as N 1

Different quantities related to the subextensive entropy have been dis-cussed in several contexts For example the code length (N) has beendened as a learning curve in the specic case of neural networks (Opperamp Haussler 1995) and has been termed the ldquothermodynamic diverdquo (Crutch-eld amp Shalizi 1999) and ldquoNth order block entropyrdquo (Grassberger 1986)

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 8: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2416 W Bialek I Nemenman and N Tishby

Imagine that we observe a stream of data x(t) over a time interval iexclT ltt lt 0 Let all of these past data be denoted by the shorthand xpast We areinterested in saying something about the future so we want to know aboutthe data x(t) that will be observed in the time interval 0 lt t lt T0 let these fu-ture data be called xfuture In the absence of any other knowledge futures aredrawn from the probability distribution P(xfuture) and observations of par-ticular past data xpast tell us that futures will be drawn from the conditionaldistribution P(xfuture|xpast) The greater concentration of the conditional dis-tribution can be quantied by the fact that it has smaller entropy than theprior distribution and this reduction in entropy is Shannonrsquos denition ofthe information that the past provides about the future We can write theaverage of this predictive information as

Ipred(T T 0 ) Dfrac12log2

micro P(xfuture|xpast)

P(xfuture)

para frac34(31)

D iexclhlog2 P(xfuture)i iexcl hlog2 P(xpast)iiexcl poundiexclhlog2 P(xfuture xpast)i

curren (32)

where hcent cent centi denotes an average over the joint distribution of the past andthe future P(xfuture xpast)

Each of the terms in equation 32 is an entropy Since we are interestedin predictability or generalization which are associated with some featuresof the signal persisting forever we may assume stationarity or invarianceunder time translations Then the entropy of the past data depends only onthe duration of our observations so we can write iexclhlog2 P(xpast)i D S(T)and by the same argument iexclhlog2 P(xfuture)i D S(T0 ) Finally the entropyof the past and the future taken together is the entropy of observations ona window of duration T C T0 so that iexclhlog2 P(xfuture xpast)i D S(T C T0 )Putting these equations together we obtain

Ipred(T T0 ) D S(T) C S(T0 ) iexcl S(T C T0 ) (33)

It is important to recall that mutual information is a symmetric quantityThus we can view Ipred(T T0 ) as either the information that a data segmentof duration T provides about the future of length T0 or the information thata data segment of duration T0 provides about the immediate past of dura-tion T This is a direct application of the denition of information but seemscounterintuitive Shouldnrsquot it be more difcult to predict than to postdictOne can perhaps recover the correct intuition by thinking about a large en-semble of question-answer pairs Prediction corresponds to generating theanswer to a given question while postdiction corresponds to generatingthe question that goes with a given answer We know that guessing ques-tions given answers is also a hard problem3 and can be just as challenging

3 This is the basis of the popular American television game show Jeopardy widelyviewed as the most ldquointellectualrdquo of its genre

Predictability Complexity and Learning 2417

as the more conventional problem of answering the questions themselvesOur focus here is on prediction because we want to make connections withthe phenomenon of generalization in learning but it is clear that generat-ing a consistent interpretation of observed data may involve elements ofboth prediction and postdiction (see for example Eagleman amp Sejnowski2000) it is attractive that the information-theoretic formulation treats theseproblems symmetrically

In the same way that the entropy of a gas at xed density is proportionalto the volume the entropy of a time series (asymptotically) is proportional toits duration so that limT1 S(T)T D S0 entropy is an extensive quantityBut from Equation 33 any extensive component of the entropy cancels inthe computation of the predictive information predictability is a deviationfrom extensivity If we write S(T) D S0T C S1(T) then equation 33 tellsus that the predictive information is related only to the nonextensive termS1(T) Note that if we are observing a deterministic system then S0 D 0 butthis is independent of questions about the structure of the subextensive termS1(T) It is attractive that information theory gives us a unied discussionof prediction in deterministic and probabilistic cases

We know two general facts about the behavior of S1(T) First the correc-tions to extensive behavior are positive S1(T) cedil 0 Second the statementthat entropy is extensive is the statement that the limit

limT1

S(T)

TD S0 (34)

exists and for this to be true we must also have

limT1

S1(T)T

D 0 (35)

Thus the nonextensive terms in the entropy must be subextensive that isthey must grow with T less rapidly than a linear function Taken togetherthese facts guarantee that the predictive information is positive and subex-tensive Furthermore if we let the future extend forward for a very longtime T0 1 then we can measure the information that our sample pro-vides about the entire future

Ipred(T) D limT01

Ipred(T T0 ) D S1(T) (36)

Similarly instead of increasing the duration of the future to innity wecould have considered the mutual information between a sample of lengthT and all of the innite past Then the postdictive information also is equal toS1(T) and the symmetry between prediction and postdiction is even moreprofound not only is there symmetry between questions and answers butobservations on a given period of time provide the same amount of infor-mation about the historical path that led to our observations as about the

2418 W Bialek I Nemenman and N Tishby

future that will unfold from them In some cases this statement becomeseven stronger For example if the subextensive entropy of a long discon-tinuous observation of a total length T with a gap of a duration dT iquest T isequal to S1(T) C O(dT

T ) then the subextensive entropy of the present is notonly its information about the past or the future but also the informationabout the past and the future

If we have been observing a time series for a (long) time T then the totalamount of data we have collected is measured by the entropy S(T) and atlarge T this is given approximately by S0T But the predictive informationthat we have gathered cannot grow linearly with time even if we are makingpredictions about a future that stretches out to innity As a result of thetotal information we have taken in by observing xpast only a vanishingfraction is of relevance to the prediction

limT1

Predictive informationTotal information

DIpred(T)

S(T) 0 (37)

In this precise sense most of what we observe is irrelevant to the problemof predicting the future4

Consider the case where time is measured in discrete steps so that wehave seen N time pointsx1 x2 xN How much have we learned about theunderlying pattern in these data The more we know the more effectivelywe can predict the next data point xNC1 and hence the fewer bits we willneed to describe the deviation of this data point from our prediction Ouraccumulated knowledge about the time series is measured by the degree towhich we can compress the description of new observations On averagethe length of the code word required to describe the point xNC1 given thatwe have seen the previous N points is given by

(N) D iexclhlog2 P(xNC1 | x1 x2 xN)i bits (38)

where the expectation value is taken over the joint distribution of all theN C 1 points P(x1 x2 xN xNC1) It is easy to see that

(N) D S(N C 1) iexcl S(N) frac14 S(N)

N (39)

As we observe for longer times we learn more and this word length de-creases It is natural to dene a learning curve that measures this improve-ment Usually we dene learning curves by measuring the frequency or

4 We can think of equation 37 asa law of diminishing returns Although we collect datain proportion to our observation time T a smaller and smaller fraction of this informationis useful in the problem of prediction These diminishing returns are not due to a limitedlifetime since we calculate the predictive information assuming that we have a futureextending forward to innity A senior colleague points out that this is an argument forchanging elds before becoming too expert

Predictability Complexity and Learning 2419

costs of errors here the cost is that our encoding of the point xNC1 is longerthan it could be if we had perfect knowledge This ideal encoding has alength that we can nd by imagining that we observe the time series for aninnitely long time ideal D limN1 (N) but this is just another way ofdening the extensive component of the entropy S0 Thus we can dene alearning curve

L (N) acute (N) iexcl ideal (310)

D S(N C 1) iexcl S(N) iexcl S0

D S1(N C 1) iexcl S1(N)

frac14 S1(N)

ND

Ipred(N)

N (311)

and we see once again that the extensive component of the entropy cancelsIt is well known that the problems of prediction and compression are

related and what we have done here is to illustrate one aspect of this con-nection Specically if we ask how much one segment of a time series cantell us about the future the answer is contained in the subextensive behaviorof the entropy If we ask how much we are learning about the structure ofthe time series then the natural and universally dened learning curve is re-lated again to the subextensive entropy the learning curve is the derivativeof the predictive information

This universal learning curve is connected to the more conventionallearning curves in specic contexts As an example (cf section 41) considertting a set of data points fxn yng with some class of functions y D f (xI reg)where reg are unknown parameters that need to be learned we also allow forsome gaussian noise in our observation of the yn Here the natural learningcurve is the evolution of Acirc

2 for generalization as a function of the number ofexamples Within the approximations discussed below it is straightforwardto show that as N becomes large

DAcirc

2(N)E

D1

s2

Dpoundy iexcl f (xI reg)

curren2E (2 ln 2) L (N) C 1 (312)

where s2 is the variance of the noise Thus a more conventional measure

of performance at learning a function is equal to the universal learningcurve dened purely by information-theoretic criteria In other words if alearning curve is measured in the right units then its integral represents theamount of the useful information accumulated Then the subextensivity ofS1 guarantees that the learning curve decreases to zero as N 1

Different quantities related to the subextensive entropy have been dis-cussed in several contexts For example the code length (N) has beendened as a learning curve in the specic case of neural networks (Opperamp Haussler 1995) and has been termed the ldquothermodynamic diverdquo (Crutch-eld amp Shalizi 1999) and ldquoNth order block entropyrdquo (Grassberger 1986)

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 9: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2417

as the more conventional problem of answering the questions themselvesOur focus here is on prediction because we want to make connections withthe phenomenon of generalization in learning but it is clear that generat-ing a consistent interpretation of observed data may involve elements ofboth prediction and postdiction (see for example Eagleman amp Sejnowski2000) it is attractive that the information-theoretic formulation treats theseproblems symmetrically

In the same way that the entropy of a gas at xed density is proportionalto the volume the entropy of a time series (asymptotically) is proportional toits duration so that limT1 S(T)T D S0 entropy is an extensive quantityBut from Equation 33 any extensive component of the entropy cancels inthe computation of the predictive information predictability is a deviationfrom extensivity If we write S(T) D S0T C S1(T) then equation 33 tellsus that the predictive information is related only to the nonextensive termS1(T) Note that if we are observing a deterministic system then S0 D 0 butthis is independent of questions about the structure of the subextensive termS1(T) It is attractive that information theory gives us a unied discussionof prediction in deterministic and probabilistic cases

We know two general facts about the behavior of S1(T) First the correc-tions to extensive behavior are positive S1(T) cedil 0 Second the statementthat entropy is extensive is the statement that the limit

limT1

S(T)

TD S0 (34)

exists and for this to be true we must also have

limT1

S1(T)T

D 0 (35)

Thus the nonextensive terms in the entropy must be subextensive that isthey must grow with T less rapidly than a linear function Taken togetherthese facts guarantee that the predictive information is positive and subex-tensive Furthermore if we let the future extend forward for a very longtime T0 1 then we can measure the information that our sample pro-vides about the entire future

Ipred(T) D limT01

Ipred(T T0 ) D S1(T) (36)

Similarly instead of increasing the duration of the future to innity wecould have considered the mutual information between a sample of lengthT and all of the innite past Then the postdictive information also is equal toS1(T) and the symmetry between prediction and postdiction is even moreprofound not only is there symmetry between questions and answers butobservations on a given period of time provide the same amount of infor-mation about the historical path that led to our observations as about the

2418 W Bialek I Nemenman and N Tishby

future that will unfold from them In some cases this statement becomeseven stronger For example if the subextensive entropy of a long discon-tinuous observation of a total length T with a gap of a duration dT iquest T isequal to S1(T) C O(dT

T ) then the subextensive entropy of the present is notonly its information about the past or the future but also the informationabout the past and the future

If we have been observing a time series for a (long) time T then the totalamount of data we have collected is measured by the entropy S(T) and atlarge T this is given approximately by S0T But the predictive informationthat we have gathered cannot grow linearly with time even if we are makingpredictions about a future that stretches out to innity As a result of thetotal information we have taken in by observing xpast only a vanishingfraction is of relevance to the prediction

limT1

Predictive informationTotal information

DIpred(T)

S(T) 0 (37)

In this precise sense most of what we observe is irrelevant to the problemof predicting the future4

Consider the case where time is measured in discrete steps so that wehave seen N time pointsx1 x2 xN How much have we learned about theunderlying pattern in these data The more we know the more effectivelywe can predict the next data point xNC1 and hence the fewer bits we willneed to describe the deviation of this data point from our prediction Ouraccumulated knowledge about the time series is measured by the degree towhich we can compress the description of new observations On averagethe length of the code word required to describe the point xNC1 given thatwe have seen the previous N points is given by

(N) D iexclhlog2 P(xNC1 | x1 x2 xN)i bits (38)

where the expectation value is taken over the joint distribution of all theN C 1 points P(x1 x2 xN xNC1) It is easy to see that

(N) D S(N C 1) iexcl S(N) frac14 S(N)

N (39)

As we observe for longer times we learn more and this word length de-creases It is natural to dene a learning curve that measures this improve-ment Usually we dene learning curves by measuring the frequency or

4 We can think of equation 37 asa law of diminishing returns Although we collect datain proportion to our observation time T a smaller and smaller fraction of this informationis useful in the problem of prediction These diminishing returns are not due to a limitedlifetime since we calculate the predictive information assuming that we have a futureextending forward to innity A senior colleague points out that this is an argument forchanging elds before becoming too expert

Predictability Complexity and Learning 2419

costs of errors here the cost is that our encoding of the point xNC1 is longerthan it could be if we had perfect knowledge This ideal encoding has alength that we can nd by imagining that we observe the time series for aninnitely long time ideal D limN1 (N) but this is just another way ofdening the extensive component of the entropy S0 Thus we can dene alearning curve

L (N) acute (N) iexcl ideal (310)

D S(N C 1) iexcl S(N) iexcl S0

D S1(N C 1) iexcl S1(N)

frac14 S1(N)

ND

Ipred(N)

N (311)

and we see once again that the extensive component of the entropy cancelsIt is well known that the problems of prediction and compression are

related and what we have done here is to illustrate one aspect of this con-nection Specically if we ask how much one segment of a time series cantell us about the future the answer is contained in the subextensive behaviorof the entropy If we ask how much we are learning about the structure ofthe time series then the natural and universally dened learning curve is re-lated again to the subextensive entropy the learning curve is the derivativeof the predictive information

This universal learning curve is connected to the more conventionallearning curves in specic contexts As an example (cf section 41) considertting a set of data points fxn yng with some class of functions y D f (xI reg)where reg are unknown parameters that need to be learned we also allow forsome gaussian noise in our observation of the yn Here the natural learningcurve is the evolution of Acirc

2 for generalization as a function of the number ofexamples Within the approximations discussed below it is straightforwardto show that as N becomes large

DAcirc

2(N)E

D1

s2

Dpoundy iexcl f (xI reg)

curren2E (2 ln 2) L (N) C 1 (312)

where s2 is the variance of the noise Thus a more conventional measure

of performance at learning a function is equal to the universal learningcurve dened purely by information-theoretic criteria In other words if alearning curve is measured in the right units then its integral represents theamount of the useful information accumulated Then the subextensivity ofS1 guarantees that the learning curve decreases to zero as N 1

Different quantities related to the subextensive entropy have been dis-cussed in several contexts For example the code length (N) has beendened as a learning curve in the specic case of neural networks (Opperamp Haussler 1995) and has been termed the ldquothermodynamic diverdquo (Crutch-eld amp Shalizi 1999) and ldquoNth order block entropyrdquo (Grassberger 1986)

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 10: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2418 W Bialek I Nemenman and N Tishby

future that will unfold from them In some cases this statement becomeseven stronger For example if the subextensive entropy of a long discon-tinuous observation of a total length T with a gap of a duration dT iquest T isequal to S1(T) C O(dT

T ) then the subextensive entropy of the present is notonly its information about the past or the future but also the informationabout the past and the future

If we have been observing a time series for a (long) time T then the totalamount of data we have collected is measured by the entropy S(T) and atlarge T this is given approximately by S0T But the predictive informationthat we have gathered cannot grow linearly with time even if we are makingpredictions about a future that stretches out to innity As a result of thetotal information we have taken in by observing xpast only a vanishingfraction is of relevance to the prediction

limT1

Predictive informationTotal information

DIpred(T)

S(T) 0 (37)

In this precise sense most of what we observe is irrelevant to the problemof predicting the future4

Consider the case where time is measured in discrete steps so that wehave seen N time pointsx1 x2 xN How much have we learned about theunderlying pattern in these data The more we know the more effectivelywe can predict the next data point xNC1 and hence the fewer bits we willneed to describe the deviation of this data point from our prediction Ouraccumulated knowledge about the time series is measured by the degree towhich we can compress the description of new observations On averagethe length of the code word required to describe the point xNC1 given thatwe have seen the previous N points is given by

(N) D iexclhlog2 P(xNC1 | x1 x2 xN)i bits (38)

where the expectation value is taken over the joint distribution of all theN C 1 points P(x1 x2 xN xNC1) It is easy to see that

(N) D S(N C 1) iexcl S(N) frac14 S(N)

N (39)

As we observe for longer times we learn more and this word length de-creases It is natural to dene a learning curve that measures this improve-ment Usually we dene learning curves by measuring the frequency or

4 We can think of equation 37 asa law of diminishing returns Although we collect datain proportion to our observation time T a smaller and smaller fraction of this informationis useful in the problem of prediction These diminishing returns are not due to a limitedlifetime since we calculate the predictive information assuming that we have a futureextending forward to innity A senior colleague points out that this is an argument forchanging elds before becoming too expert

Predictability Complexity and Learning 2419

costs of errors here the cost is that our encoding of the point xNC1 is longerthan it could be if we had perfect knowledge This ideal encoding has alength that we can nd by imagining that we observe the time series for aninnitely long time ideal D limN1 (N) but this is just another way ofdening the extensive component of the entropy S0 Thus we can dene alearning curve

L (N) acute (N) iexcl ideal (310)

D S(N C 1) iexcl S(N) iexcl S0

D S1(N C 1) iexcl S1(N)

frac14 S1(N)

ND

Ipred(N)

N (311)

and we see once again that the extensive component of the entropy cancelsIt is well known that the problems of prediction and compression are

related and what we have done here is to illustrate one aspect of this con-nection Specically if we ask how much one segment of a time series cantell us about the future the answer is contained in the subextensive behaviorof the entropy If we ask how much we are learning about the structure ofthe time series then the natural and universally dened learning curve is re-lated again to the subextensive entropy the learning curve is the derivativeof the predictive information

This universal learning curve is connected to the more conventionallearning curves in specic contexts As an example (cf section 41) considertting a set of data points fxn yng with some class of functions y D f (xI reg)where reg are unknown parameters that need to be learned we also allow forsome gaussian noise in our observation of the yn Here the natural learningcurve is the evolution of Acirc

2 for generalization as a function of the number ofexamples Within the approximations discussed below it is straightforwardto show that as N becomes large

DAcirc

2(N)E

D1

s2

Dpoundy iexcl f (xI reg)

curren2E (2 ln 2) L (N) C 1 (312)

where s2 is the variance of the noise Thus a more conventional measure

of performance at learning a function is equal to the universal learningcurve dened purely by information-theoretic criteria In other words if alearning curve is measured in the right units then its integral represents theamount of the useful information accumulated Then the subextensivity ofS1 guarantees that the learning curve decreases to zero as N 1

Different quantities related to the subextensive entropy have been dis-cussed in several contexts For example the code length (N) has beendened as a learning curve in the specic case of neural networks (Opperamp Haussler 1995) and has been termed the ldquothermodynamic diverdquo (Crutch-eld amp Shalizi 1999) and ldquoNth order block entropyrdquo (Grassberger 1986)

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 11: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2419

costs of errors here the cost is that our encoding of the point xNC1 is longerthan it could be if we had perfect knowledge This ideal encoding has alength that we can nd by imagining that we observe the time series for aninnitely long time ideal D limN1 (N) but this is just another way ofdening the extensive component of the entropy S0 Thus we can dene alearning curve

L (N) acute (N) iexcl ideal (310)

D S(N C 1) iexcl S(N) iexcl S0

D S1(N C 1) iexcl S1(N)

frac14 S1(N)

ND

Ipred(N)

N (311)

and we see once again that the extensive component of the entropy cancelsIt is well known that the problems of prediction and compression are

related and what we have done here is to illustrate one aspect of this con-nection Specically if we ask how much one segment of a time series cantell us about the future the answer is contained in the subextensive behaviorof the entropy If we ask how much we are learning about the structure ofthe time series then the natural and universally dened learning curve is re-lated again to the subextensive entropy the learning curve is the derivativeof the predictive information

This universal learning curve is connected to the more conventionallearning curves in specic contexts As an example (cf section 41) considertting a set of data points fxn yng with some class of functions y D f (xI reg)where reg are unknown parameters that need to be learned we also allow forsome gaussian noise in our observation of the yn Here the natural learningcurve is the evolution of Acirc

2 for generalization as a function of the number ofexamples Within the approximations discussed below it is straightforwardto show that as N becomes large

DAcirc

2(N)E

D1

s2

Dpoundy iexcl f (xI reg)

curren2E (2 ln 2) L (N) C 1 (312)

where s2 is the variance of the noise Thus a more conventional measure

of performance at learning a function is equal to the universal learningcurve dened purely by information-theoretic criteria In other words if alearning curve is measured in the right units then its integral represents theamount of the useful information accumulated Then the subextensivity ofS1 guarantees that the learning curve decreases to zero as N 1

Different quantities related to the subextensive entropy have been dis-cussed in several contexts For example the code length (N) has beendened as a learning curve in the specic case of neural networks (Opperamp Haussler 1995) and has been termed the ldquothermodynamic diverdquo (Crutch-eld amp Shalizi 1999) and ldquoNth order block entropyrdquo (Grassberger 1986)

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 12: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2420 W Bialek I Nemenman and N Tishby

The universal learning curve L (N) has been studied as the expected instan-taneous information gain by Haussler Kearns amp Schapire (1994) Mutualinformation between all of the past and all of the future (both semi-innite)is known also as the excess entropy effective measure complexity stored in-formation and so on (see Shalizi amp Crutcheld 1999 and references thereinas well as the discussion below) If the data allow a description by a modelwith a nite (and in some cases also innite) number of parameters thenmutual information between the data and the parameters is of interest Thisis easily shown to be equal to the predictive information about all of thefuture and it is also the cumulative information gain (Haussler Kearns ampSchapire 1994) or the cumulative relative entropy risk (Haussler amp Opper1997) Investigation of this problem can be traced back to Renyi (1964) andIbragimov and Hasminskii (1972) and some particular topics are still beingdiscussed (Haussler amp Opper 1995 Opper amp Haussler 1995 Herschkowitzamp Nadal 1999) In certain limits decoding a signal from a population of Nneurons can be thought of as ldquolearningrdquo a parameter from N observationswith a parallel notion of information transmission (Brunel amp Nadal 1998Kang amp Sompolinsky 2001) In addition the subextensive component ofthe description length (Rissanen 1978 1989 1996 Clarke amp Barron 1990)averaged over a class of allowed models also is similar to the predictiveinformation What is important is that the predictive information or subex-tensive entropy is related to all these quantities and that it can be dened forany process without a reference to a class of models It is this universalitythat we nd appealing and this universality is strongest if we focus on thelimit of long observation times Qualitatively in this regime (T 1) weexpect the predictive information to behave in one of three different waysas illustrated by the Ising models above it may either stay nite or grow toinnity together with T in the latter case the rate of growth may be slow(logarithmic) or fast (sublinear power) (see Barron and Cover 1991 for asimilar classication in the framework of the minimal description length[MDL] analysis)

The rst possibility limT1 Ipred(T) D constant means that no matterhow long we observe we gain only a nite amount of information aboutthe future This situation prevails for example when the dynamics are tooregular For a purely periodic system complete prediction is possible oncewe know the phase and if we sample the data at discrete times this is a niteamount of information longer period orbits intuitively are more complexand also have larger Ipred but this does not change the limiting behaviorlimT1 Ipred(T) D constant

Alternatively the predictive informationcan be small when the dynamicsare irregular but the best predictions are controlled only by the immediatepast so that the correlation times of the observable data are nite (see forexample Crutcheld amp Feldman 1997 and the xed J case in Figure 3)Imagine for example that we observe x(t) at a series of discrete times ftngand that at each time point we nd the value xn Then we always can write

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 13: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2421

the joint distribution of the N data points as a product

P(x1 x2 xN) D P(x1)P(x2 |x1)P(x3 |x2 x1) cent cent cent (313)

For Markov processes what we observe at tn depends only on events at theprevious time step tniexcl1 so that

P(xn |fx1middotimiddotniexcl1g) D P(xn |xniexcl1) (314)

and hence the predictive information reduces to

Ipred Dfrac12log2

microP(xn |xniexcl1)P(xn)

parafrac34 (315)

The maximum possible predictive information in this case is the entropy ofthe distribution of states at one time step which is bounded by the loga-rithm of the number of accessible states To approach this bound the systemmust maintain memory for a long time since the predictive information isreduced by the entropy of the transition probabilities Thus systems withmore states and longer memories have larger values of Ipred

More interesting are those cases in which Ipred(T) diverges at large T Inphysical systems we know that there are critical points where correlationtimes become innite so that optimal predictions will be inuenced byevents in the arbitrarily distant past Under these conditions the predictiveinformation can grow without bound as T becomes large for many systemsthe divergence is logarithmic Ipred(T 1) ln T as for the variable Jijshort-range Ising model of Figures 2 and 3 Long-range correlations alsoare important in a time series where we can learn some underlying rulesIt will turn out that when the set of possible rules can be described bya nite number of parameters the predictive information again divergeslogarithmically and the coefcient of this divergence counts the number ofparameters Finally a faster growth is also possible so that Ipred(T 1) Ta as for the variable Jij long-range Ising model and we shall see that thisbehavior emerges from for example nonparametric learning problems

4 Learning and Predictability

Learning is of interest precisely in those situations where correlations orassociations persist over long periods of time In the usual theoretical mod-els there is some rule underlying the observable data and this rule is validforever examples seen at one time inform us about the rule and this infor-mation can be used to make predictions or generalizations The predictiveinformation quanties the average generalization power of examples andwe shall see that there is a direct connection between the predictive infor-mation and the complexity of the possible underlying rules

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 14: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2422 W Bialek I Nemenman and N Tishby

41 A Test Case Let us begin with a simple example already referred toabove We observe two streams of data x and y or equivalently a stream ofpairs (x1 y1) (x2 y2) (xN yN ) Assume that we know in advance thatthe xrsquos are drawn independently and at random from a distribution P(x)while the yrsquos are noisy versions of some function acting on x

yn D f (xnI reg) C gn (41)

where f (xI reg) is a class of functions parameterized by reg and gn is noisewhich for simplicity we will assume is gaussian with known standard devi-ation s We can even start with a very simple case where the function classis just a linear combination of basis functions so that

f (xI reg) DKX

m D1am wm (x) (42)

The usual problem is to estimate from N pairs fxi yig the values of theparameters reg In favorable cases such as this we might even be able tond an effective regression formula We are interested in evaluating thepredictive information which means that we need to know the entropyS(N) We go through the calculation in some detail because it provides amodel for the more general case

To evaluate the entropy S(N) we rst construct the probability distribu-tion P(x1 y1 x2 y2 xN yN) The same set of rules applies to the wholedata stream which here means that the same parameters reg apply for allpairs fxi yig but these parameters are chosen at random from a distributionP (reg) at the start of the stream Thus we write

P(x1 y1 x2 y2 xN yN)

DZ

dKaP(x1 y1 x2 y2 xN yN |reg)P (reg) (43)

and now we need to construct the conditional distributions for xed reg Byhypothesis each x is chosen independently and once we x reg each yi iscorrelated only with the corresponding xi so that we have

P(x1 y1 x2 y2 xN yN |reg) DNY

iD1

poundP(xi) P(yi | xiI reg)

curren (44)

Furthermore with the simple assumptions above about the class of func-tions and gaussian noise the conditional distribution of yi has the form

P(yi | xiI reg) D1

p2p s2

exp

2

4iexcl1

2s2

sup3yi iexcl

KX

m D1am wm (xi)

23

5 (45)

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 15: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2423

Putting all these factors together

P(x1 y1 x2 y2 xN yN )

D

NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

a P (reg) exp

iexcl

12s2

NX

iD1y2

i

pound exp

iexclN

2

KX

m ordmD1Amordm(fxig)am aordm C N

KX

m D1Bm (fxi yig)am

(46)

where

Amordm(fxig) D1

s2N

NX

iD1wm (xi)wordm(xi) and (47)

Bm (fxi yig) D1

s2N

NX

iD1yiwm (xi) (48)

Our placement of the factors of N means that both Amordm and Bm are of orderunity as N 1 These quantities are empirical averages over the samplesfxi yig and if the wm are well behaved we expect that these empirical meansconverge to expectation values for most realizations of the series fxig

limN1

Amordm(fxig) D A1mordm D

1

s2

ZdxP(x)wm (x)wordm(x) (49)

limN1

Bm (fxi yig) D B1m D

KX

ordmD1A1

mordm Naordm (410)

where Nreg are the parameters that actually gave rise to the data stream fxi yigIn fact we can make the same argument about the terms in P

y2i

limN1

NX

iD1y2

i D Ns2

KX

m ordmD1Nam A1

mordm Naordm C 1

(411)

Conditions for this convergence of empirical means to expectation valuesare at the heart of learning theory Our approach here is rst to assume thatthis convergence works then to examine the consequences for the predictiveinformation and nally to address the conditions for and implications ofthis convergence breaking down

Putting the different factors together we obtain

P(x1 y1 x2 y2 xN yN ) f NY

iD1P(xi)

sup3 1p

2p s2

acuteN ZdK

aP (reg)

pound exppoundiexclNEN(regI fxi yig)

curren (412)

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 16: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2424 W Bialek I Nemenman and N Tishby

where the effective ldquoenergyrdquo per sample is given by

EN(regI fxi yig) D12

C12

KX

m ordmD1

(am iexcl Nam )A1mordm

(aordm iexcl Naordm) (413)

Here we use the symbol f to indicate that we not only take the limit oflarge N but also neglect the uctuations Note that in this approximationthe dependence on the sample points themselves is hidden in the denitionof Nreg as being the parameters that generated the samples

The integral that we need to do in equation 412 involves an exponentialwith a large factor N in the exponent the energy EN is of order unity asN 1 This suggests that we evaluate the integral by a saddle-pointor steepest-descent approximation (similar analyses were performed byClarke amp Barron 1990 MacKay 1992 and Balasubramanian 1997)

ZdK

aP (reg) exppoundiexclNEN (regI fxi yig)

curren

frac14 P (regcl) expmicro

iexclNEN(regclI fxi yig) iexclK2

lnN2p

iexcl12

ln det FN C cent cent centpara

(414)

where regcl is the ldquoclassicalrdquo value of reg determined by the extremal conditions

EN(regI fxi yig)am

shyshyshyshyaDacl

D 0 (415)

the matrix FN consists of the second derivatives of EN

FN D 2EN (regI fxi yig)

am aordm

shyshyshyshyaDacl

(416)

and cent cent cent denotes terms that vanish as N 1 If we formulate the problemof estimating the parameters reg from the samples fxi yig then as N 1the matrix NFN is the Fisher information matrix (Cover amp Thomas 1991)the eigenvectors of this matrix give the principal axes for the error ellipsoidin parameter space and the (inverse) eigenvalues give the variances ofparameter estimates along each of these directions The classical regcl differsfrom Nreg only in terms of order 1N we neglect this difference and furthersimplify the calculation of leading terms as N becomes large After a littlemore algebra we nd the probability distribution we have been lookingfor

P(x1 y1 x2 y2 xN yN)

f NY

iD1P(xi)

1

ZAP ( Nreg) exp

microiexcl

N2

ln(2p es2) iexcl

K2

ln N C cent cent centpara

(417)

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 17: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2425

where the normalization constant

ZA Dp

(2p )K det A1 (418)

Again we note that the sample points fxi yig are hidden in the value of Nregthat gave rise to these points5

To evaluate the entropy S(N) we need to compute the expectation valueof the (negative) logarithm of the probability distribution in equation 417there are three terms One is constant so averaging is trivial The secondterm depends only on the xi and because these are chosen independentlyfrom the distribution P(x) the average again is easy to evaluate The thirdterm involves Nreg and we need to average this over the joint distributionP(x1 y1 x2 y2 xN yN) As above we can evaluate this average in stepsFirst we choose a value of the parameters Nreg then we average over thesamples given these parameters and nally we average over parametersBut because Nreg is dened as the parameters that generate the samples thisstepwise procedure simplies enormously The end result is that

S(N) D NmicroSx C

12

log2

plusmn2p es

2sup2para

CK2

log2 N

C Sa Claquolog2 ZA

nota

C cent cent cent (419)

where hcent cent centia means averaging over parameters Sx is the entropy of thedistribution of x

Sx D iexclZ

dx P(x) log2 P(x) (420)

and similarly for the entropy of the distribution of parameters

Sa D iexclZ

dKaP (reg) log2 P (reg) (421)

5 We emphasize once more that there are two approximations leading to equation 417First we have replaced empirical means by expectation values neglecting uctuations as-sociated with the particular set of sample points fxi yig Second we have evaluated theaverage over parameters in a saddle-point approximation At least under some condi-tions both of these approximations become increasingly accurate as N 1 so thatthis approach should yield the asymptotic behavior of the distribution and hence thesubextensive entropy at large N Although we give a more detailed analysis below it isworth noting here how things can go wrong The two approximations are independentand we could imagine that uctuations are important but saddle-point integration stillworks for example Controlling the uctuations turns out to be exactly the question ofwhether our nite parameterization captures the true dimensionality of the class of mod-els as discussed in the classic work of Vapnik Chervonenkis and others (see Vapnik1998 for a review) The saddle-point approximation can break down because the saddlepoint becomes unstable or because multiple saddle points become important It will turnout that instability is exponentially improbable as N 1 while multiple saddle pointsare a real problem in certain classes of models again when counting parameters does notreally measure the complexity of the model class

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 18: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2426 W Bialek I Nemenman and N Tishby

The different terms in the entropy equation 419 have a straightforwardinterpretation First we see that the extensive term in the entropy

S0 D Sx C12

log2(2p es2) (422)

reects contributions from the random choice of x and from the gaussiannoise in y these extensive terms are independent of the variations in pa-rameters reg and these would be the only terms if the parameters were notvarying (that is if there were nothing to learn) There also is a term thatreects the entropy of variations in the parameters themselves Sa Thisentropy is not invariant with respect to coordinate transformations in theparameter space but the term hlog2 ZAia compensates for this noninvari-ance Finally and most interesting for our purposes the subextensive pieceof the entropy is dominated by a logarithmic divergence

S1(N) K2

log2 N (bits) (423)

The coefcient of this divergence counts the number of parameters indepen-dent of the coordinate system that we choose in the parameter space Fur-thermore this result does not depend on the set of basis functions fwm (x)gThis is a hint that the result in equation 423 is more universal than oursimple example

42 Learning a Parameterized Distribution The problem discussedabove is an example of supervised learning We are given examples of howthe points xn map into yn and from these examples we are to induce theassociation or functional relation between x and y An alternative view isthat the pair of points (x y) should be viewed as a vector Ex and what we arelearning is the distribution of this vector The problem of learning a distri-bution usually is called unsupervised learning but in this case supervisedlearning formally is a special case of unsupervised learning if we admit thatall the functional relations or associations that we are trying to learn have anelement of noise or stochasticity then this connection between supervisedand unsupervised problems is quite general

Suppose a series of random vector variables fExig is drawn independentlyfrom the same probability distribution Q(Ex|reg) and this distribution de-pends on a (potentially innite dimensional) vector of parameters reg Asabove the parameters are unknown and before the series starts they arechosen randomly from a distribution P (reg) With no constraints on the den-sities P (reg) or Q(Ex|reg) it is impossible to derive any regression formulas forparameter estimation but one can still say something about the entropy ofthe data series and thus the predictive information For a nite-dimensionalvector of parameters reg the literature on bounding similar quantities isespecially rich (Haussler Kearns amp Schapire 1994 Wong amp Shen 1995Haussler amp Opper 1995 1997 and references therein) and related asymp-totic calculations have been done (Clarke amp Barron 1990 MacKay 1992Balasubramanian 1997)

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 19: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2427

We begin with the denition of entropy

S(N) acute S[fExig]

D iexclZ

dEx1 cent cent cent dExNP(Ex1 Ex2 ExN) log2 P(Ex1 Ex2 ExN) (424)

By analogy with equation 43 we then write

P(Ex1 Ex2 ExN) DZ

dKaP (reg)

NY

iD1Q(Exi |reg) (425)

Next combining the equations 424 and 425 and rearranging the order ofintegration we can rewrite S(N) as

S(N) D iexclZ

dK NregP ( Nreg)

pound

8lt

ZdEx1 cent cent cent dExN

NY

jD1Q(Exj | Nreg) log2 P(fExig)

9=

(426)

Equation 426 allows an easy interpretation There is the ldquotruerdquo set of pa-rameters Nreg that gave rise to the data sequence Ex1 ExN with the probabilityQN

jD1 Q(Exj | Nreg) We need to average log2 P(Ex1 ExN) rst over all possiblerealizations of the data keeping the true parameters xed and then over theparameters Nreg themselves With this interpretation in mind the joint prob-ability density the logarithm of which is being averaged can be rewrittenin the following useful way

P(Ex1 ExN) DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg)NY

iD1

microQ(Exi |reg)

Q(Exi | Nreg)

para

DNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNEN(regI fExig)] (427)

EN (regI fExig) D iexcl1N

NX

iD1ln

microQ(Exi |reg)

Q(Exi | Nreg)

para (428)

Since by our interpretation Nreg are the true parameters that gave rise tothe particular data fExig we may expect that the empirical average in equa-tion 428 will converge to the corresponding expectation value so that

EN(regI fExig) D iexclZ

dDxQ(x | Nreg) lnmicroQ(Ex|reg)

Q(Ex| Nreg)

paraiexcl y (reg NregI fxig) (429)

where y 0 as N 1 here we neglect y and return to this term below

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 20: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2428 W Bialek I Nemenman and N Tishby

The rst term on the right-hand side of equation 429 is the Kullback-Leibler (KL) divergence DKL( Nregkreg) between the true distribution charac-terized by parameters Nreg and the possible distribution characterized by regThus at large N we have

P(Ex1 Ex2 ExN)

fNY

jD1Q(Exj | Nreg)

ZdK

aP (reg) exp [iexclNDKL( Nregkreg)] (430)

where again the notation f reminds us that we are not only taking the limitof largeN but also making another approximationinneglecting uctuationsBy the same arguments as above we can proceed (formally) to compute theentropy of this distribution We nd

S(N) frac14 S0 cent N C S(a)1 (N) (431)

S0 DZ

dKaP (reg)

microiexcl

ZdDxQ(Ex|reg) log2 Q(Ex|reg)

para and (432)

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2

microZdK

aP(reg)eiexclNDKL ( Naka)para

(433)

Here S(a)1 is an approximation to S1 that neglects uctuations y This is the

same as the annealed approximation in the statistical mechanics of disor-dered systems as has been used widely in the study of supervised learn-ing problems (Seung Sompolinsky amp Tishby 1992) Thus we can identifythe particular data sequence Ex1 ExN with the disorder EN(regI fExig) withthe energy of the quenched system and DKL( Nregkreg) with its annealed ana-log

The extensive term S0 equation 432 is the average entropy of a dis-tribution in our family of possible distributions generalizing the result ofequation 422 The subextensive terms in the entropy are controlled by theN dependence of the partition function

Z( NregI N) DZ

dKaP (reg) exp [iexclNDKL( Nregkreg)] (434)

and Sa1(N) D iexclhlog2 Z( NregI N)i Na is analogous to the free energy Since what is

important in this integral is the KL divergence between different distribu-tions it is natural to ask about the density of models that are KL divergenceD away from the target Nreg

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)] (435)

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 21: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2429

This density could be very different for different targets6 The density ofdivergences is normalized because the original distribution over parameterspace P(reg) is normalized

ZdDr (DI Nreg) D

ZdK

aP (reg) D 1 (436)

Finally the partition function takes the simple form

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND] (437)

We recall that in statistical mechanics the partition function is given by

Z(b ) DZ

dEr (E) exp[iexclbE] (438)

where r (E) is the density of states that have energy E and b is the inversetemperature Thus the subextensive entropy in our learning problem isanalogous to a system in which energy corresponds to the KL divergencerelative to the target model and temperature is inverse to the number ofexamples As we increase the length N of the time series we have observedwe ldquocoolrdquo the system and hence probe models that approach the targetthe dynamics of this approach is determined by the density of low-energystates that is the behavior of r (DI Nreg) as D 07

The structure of the partition function is determined by a competitionbetween the (Boltzmann) exponential term which favors models with smallD and the density term which favorsvalues of D that can be achieved by thelargest possible number of models Because there (typically) are many pa-rameters there are very few models with D 0 This picture of competitionbetween the Boltzmann factor and a density of states has been emphasizedin previous work on supervised learning (Haussler et al 1996)

The behavior of the density of states r (DI Nreg) at small D is related tothe more intuitive notion of dimensionality In a parameterized family ofdistributions the KL divergence between two distributions with nearbyparameters is approximately a quadratic form

DKL ( Nregkreg) frac1412

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm) C cent cent cent (439)

6 If parameter space is compact then a related description of the space of targets basedon metric entropy also called Kolmogorovrsquos 2 -entropy is used in the literature (see forexample Haussler amp Opper 1997) Metric entropy is the logarithm of the smallest numberof disjoint parts of the size not greater than 2 into which the target space can be partitioned

7 It may be worth emphasizing that this analogy to statistical mechanics emerges fromthe analysis of the relevant probability distributions rather than being imposed on theproblem through some assumptions about the nature of the learning algorithm

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 22: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2430 W Bialek I Nemenman and N Tishby

where NF is the Fisher information matrix Intuitively if we have a reason-able parameterization of the distributions then similar distributions will benearby in parameter space and more important points that are far apartin parameter space will never correspond to similar distributions Clarkeand Barron (1990) refer to this condition as the parameterization forminga ldquosoundrdquo family of distributions If this condition is obeyed then we canapproximate the low D limit of the density r (DI Nreg)

r (DI Nreg) DZ

dKaP (reg)d[D iexcl DKL( Nregkreg)]

frac14Z

dKaP (reg)d

D iexcl

12

X

mordm

iexclNam iexcl am

centFmordm ( Naordm iexcl aordm)

DZ

dKaP ( Nreg C U cent raquo)d

D iexcl

12

X

m

Lmj2m

(440)

where U is a matrix that diagonalizes F

( U T cent F cent U )mordm D Lmdmordm (441)

The delta function restricts the components of raquo in equation 440 to be oforder

pD or less and so if P(reg) is smooth we can make a perturbation

expansion After some algebra the leading term becomes

r (D 0I Nreg) frac14 P ( Nreg)2p

K2

C (K2)(det F )iexcl12 D(Kiexcl2)2 (442)

Here as before K is the dimensionality of the parameter vector Computingthe partition function from equation 437 we nd

Z( NregI N 1) frac14 f ( Nreg) cent C (K2)

NK2 (443)

where f ( Nreg) is some function of the target parameter values Finally thisallows us to evaluate the subextensive entropy from equations 433 and 434

S(a)1 (N) D iexcl

ZdK NaP ( Nreg) log2 Z( NregI N) (444)

K2

log2 N C cent cent cent (bits) (445)

where cent cent cent are nite as N 1 Thus general K-parameter model classeshave the same subextensive entropy as for the simplest example consideredin the previous section To the leading order this result is independent even

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 23: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2431

of the prior distribution P (reg) on the parameter space so that the predictiveinformation seems to count the number of parameters under some verygeneral conditions (cf Figure 3 for a very different numerical example ofthe logarithmic behavior)

Although equation 445 is true under a wide range of conditions thiscannot be the whole story Much of modern learning theory is concernedwith the fact that counting parameters is not quite enough to characterize thecomplexity of a model class the naive dimension of the parameter space Kshould be viewed in conjunction with the pseudodimension (also known asthe shattering dimension or Vapnik-Chervonenkis dimension dVC) whichmeasures capacity of the model class and with the phase-space dimension dwhich accounts for volumes in the space of models (Vapnik 1998 Opper1994) Both of these dimensions can differ from the number of parametersOne possibility is that dVC is innite when the number of parameters isnitea problem discussed below Another possibility is that the determinant of Fis zero and hence both dVC and d are smaller than the number of parametersbecause we have adopted a redundant description This sort of degeneracycould occur over a nite fraction but not all of the parameter space andthis is one way to generate an effective fractional dimensionality One canimagine multifractal models such that the effective dimensionality variescontinuously over the parameter space but it is not obvious where thiswould be relevant Finally models with d lt dVC lt 1 also are possible (seefor example Opper 1994) and this list probably is not exhaustive

Equation 442 lets us actually dene the phase-space dimension throughthe exponent in the small DKL behavior of the model density

r (D 0I Nreg) D(diexcl2)2 (446)

and then d appears in place of K as the coefcient of the log divergencein S1(N) (Clarke amp Barron 1990 Opper 1994) However this simple con-clusion can fail in two ways First it can happen that the density r (DI Nreg)accumulates a macroscopic weight at some nonzero value of D so that thesmall D behavior is irrelevant for the large N asymptotics Second the uc-tuations neglected here may be uncontrollably large so that the asymptoticsare never reached Since controllability of uctuations is a function of dVC(see Vapnik 1998 Haussler Kearns amp Schapire 1994 and later in this arti-cle) we may summarize this in the following way Provided that the smallD behavior of the density function is the relevant one the coefcient ofthe logarithmic divergence of Ipred measures the phase-space or the scalingdimension d and nothing else This asymptote is valid however only forN Agrave dVC It is still an open question whether the two pathologies that canviolate this asymptotic behavior are related

43 Learning a Parameterized Process Consider a process where sam-ples are not independent and our task is to learn their joint distribution

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 24: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2432 W Bialek I Nemenman and N Tishby

Q(Ex1 ExN |reg) Again reg is an unknown parameter vector chosen ran-domly at the beginning of the series If reg is a K-dimensional vector thenone still tries to learn just K numbers and there are still N examples evenif there are correlations Therefore although such problems are much moregeneral than those already considered it is reasonable to expect that thepredictive information still is measured by (K2) log2 N provided that someconditions are met

One might suppose that conditions for simple results on the predictiveinformation are very strong for example that the distribution Q is a nite-order Markov model In fact all we really need are the following two con-ditions

S [fExig | reg] acute iexclZ

dN Ex Q(fExig | reg) log2 Q(fExig | reg)

NS0 C Scurren0 I Scurren

0 D O(1) (447)

DKL [Q(fExig | Nreg)kQ(fExig|reg)] NDKL ( Nregkreg) C o(N) (448)

Here the quantities S0 Scurren0 and DKL ( Nregkreg) are dened by taking limits

N 1 in both equations The rst of the constraints limits deviations fromextensivity to be of order unity so that if reg is known there are no long-rangecorrelations in the data All of the long-range predictability is associatedwith learning the parameters8 The second constraint equation 448 is aless restrictive one and it ensures that the ldquoenergyrdquo of our statistical systemis an extensive quantity

With these conditions it is straightforward to show that the results of theprevious subsection carry over virtually unchanged With the same cautiousstatements about uctuations and the distinction between K d and dVC onearrives at the result

S(N) D S0 cent N C S(a)1 (N) (449)

S(a)1 (N) D

K2

log2 N C cent cent cent (bits) (450)

where cent cent cent stands for terms of order one as N 1 Note again that for theresult of equation 450 to be valid the process considered is not required tobe a nite-order Markov process Memory of all previous outcomes may bekept provided that the accumulated memory does not contribute a diver-gent term to the subextensive entropy

8 Suppose that we observe a gaussian stochastic process and try to learn the powerspectrum If the class of possible spectra includes ratios of polynomials in the frequency(rational spectra) then this condition is met On the other hand if the class of possiblespectra includes 1 f noise then the condition may not be met For more on long-rangecorrelations see below

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 25: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2433

It is interesting to ask what happens if the condition in equation 447 isviolated so that there are long-range correlations even in the conditional dis-tribution Q(Ex1 ExN |reg) Suppose for example that Scurren

0 D (Kcurren2) log2 NThen the subextensive entropy becomes

S(a)1 (N) D

K C Kcurren

2log2 N C cent cent cent (bits) (451)

We see that the subextensive entropy makes no distinction between pre-dictability that comes from unknown parameters and predictability thatcomes from intrinsic correlations in the data in this sense two models withthe same KC Kcurren are equivalent This actually must be so As an example con-sider a chain of Ising spins with long-range interactions in one dimensionThis system can order (magnetize) and exhibit long-range correlations andso the predictive information will diverge at the transition to ordering Inone view there is no global parameter analogous to regmdashjust the long-rangeinteractions On the other hand there are regimes in which we can approxi-mate the effect of these interactions by saying that all the spins experience amean eld that is constant across the whole length of the system and thenformally we can think of the predictive information as being carried by themean eld itself In fact there are situations in which this is not just anapproximation but an exact statement Thus we can trade a description interms of long-range interactions (Kcurren 6D 0 but K D 0) for one in which thereare unknown parameters describing the system but given these parametersthere are no long-range correlations (K 6D 0 Kcurren D 0) The two descriptionsare equivalent and this is captured by the subextensive entropy9

44 Taming the Fluctuations The preceding calculations of the subex-tensive entropy S1 are worthless unless we prove that the uctuations y arecontrollable We are going to discuss when and if this indeed happens Welimit the discussion to the case of nding a probability density (section 42)the case of learning a process (section 43) is very similar

Clarke and Barron (1990) solved essentially the same problem They didnot make a separation into the annealed and the uctuation term and thequantity they were interested in was a bit different from ours but inter-preting loosely they proved that modulo some reasonable technical as-sumptions on differentiability of functions in question the uctuation termalways approaches zero However they did not investigate the speed ofthis approach and we believe that they therefore missed some importantqualitative distinctions between different problems that can arise due to adifference between d and dVC In order to illuminate these distinctions wehere go through the trouble of analyzing uctuations all over again

9 There are a number of interesting questions about how the coefcients in the diverg-ing predictive information relate to the usual critical exponents and we hope to return tothis problem in a later article

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 26: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2434 W Bialek I Nemenman and N Tishby

Returning to equations 427 429 and the denition of entropy we canwrite the entropy S(N) exactly as

S(N) D iexclZ

dK NaP ( Nreg)Z NY

jD1

pounddExj Q(Exj | Nreg)curren

pound log2

NY

iD1Q(Exi | Nreg)

ZdK

aP (reg)

pound eiexclNDKL ( Naka)CNy (a NaIfExig)

(452)

This expression can be decomposed into the terms identied above plus anew contribution to the subextensive entropy that comes from the uctua-tions alone S

(f)1 (N)

S(N) D S0 cent N C S(a)1 (N) C S(f)

1 (N) (453)

S(f)1 D iexcl

ZdK NaP ( Nreg)

NY

jD1

pounddExj Q(Exj | Nreg)

curren

pound log2

microZ dKaP (reg)

Z( NregI N)eiexclNDKL ( Naka)CNy (a NaIfExig)

para (454)

where y is dened as in equation 429 and Z as in equation 434Some loose but useful bounds can be established First the predictive in-

formation is a positive (semidenite) quantity and so the uctuation termmay not be smaller (more negative) than the value of iexclS

(a)1 as calculated

in equations 445 and 450 Second since uctuations make it more dif-cult to generalize from samples the predictive information should alwaysbe reduced by uctuations so that S(f) is negative This last statement cor-responds to the fact that for the statistical mechanics of disordered systemsthe annealed free energy always is less than the average quenched free en-ergy and may be proven rigorously by applying Jensenrsquos inequality to the(concave) logarithm function in equation 454 essentially the same argu-ment was given by Haussler and Opper (1997) A related Jensenrsquos inequalityargument allows us to show that the total S1(N) is bounded

S1(N) middot NZ

dKa

ZdK NaP (reg)P ( Nreg)DKL( Nregkreg)

acute hNDKL( Nregkreg)iNaa (455)

so that if we have a class of models (and a prior P (reg)) such that the aver-age KL divergence among pairs of models is nite then the subextensiveentropy necessarily is properly denedIn its turn niteness of the average

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 27: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2435

KL divergence or of similar quantities is a usual constraint in the learningproblems of this type (see for example Haussler amp Opper 1997) Note alsothat hDKL( Nregkreg)i Naa includes S0 as one of its terms so that usually S0 andS1 are well or ill dened together

Tighter bounds require nontrivial assumptions about the class of distri-butions considered The uctuation term would be zero if y were zero andy is the difference between an expectation value (KL divergence) and thecorresponding empirical mean There is a broad literature that deals withthis type of difference (see for example Vapnik 1998)

We start with the case when the pseudodimension (dVC) of the set ofprobability densities fQ(Ex | reg)g is nite Then for any reasonable functionF(ExI b ) deviations of the empirical mean from the expectation value can bebounded by probabilistic bounds of the form

P

(sup

b

shyshyshyshyshy

1N

Pj F(ExjI b ) iexcl R

dEx Q(Ex | Nreg) F(ExI b )

L[F]

shyshyshyshyshygt 2

)

lt M(2 N dVC)eiexclcN2 2 (456)

where c and L[F] depend on the details of the particular bound used Typi-cally c is a constant of order one and L[F] either is some moment of F or therange of its variation In our case F is the log ratio of two densities so thatL[F] may be assumed bounded for almost all b without loss of generalityin view of equation 455 In addition M(2 N dVC) is nite at zero growsat most subexponentially in its rst two arguments and depends exponen-tially on dVC Bounds of this form may have different names in differentcontextsmdashfor example Glivenko-Cantelli Vapnik-Chervonenkis Hoeffd-ing Chernoff (for review see Vapnik 1998 and the references therein)

To start the proof of niteness of S(f)1 in this case we rst show that

only the region reg frac14 Nreg is important when calculating the inner integral inequation 454 This statement is equivalent to saying that at large values ofreg iexcl Nreg the KL divergence almost always dominates the uctuation termthat is the contribution of sequences of fExig with atypically large uctua-tions is negligible (atypicality is dened as y cedil d where d is some smallconstant independent of N) Since the uctuations decrease as 1

pN (see

equation 456) and DKL is of order one this is plausible To show this webound the logarithm in equation 454 by N times the supremum value ofy Then we realize that the averaging over Nreg and fExig is equivalent to inte-gration over all possible values of the uctuations The worst-case densityof the uctuations may be estimated by differentiating equation 456 withrespect to 2 (this brings down an extra factor of N2 ) Thus the worst-casecontribution of these atypical sequences is

S(f)atypical1 raquo

Z 1

d

d2 N22 2M(2 )eiexclcN2 2raquo eiexclcNd2

iquest 1 for large N (457)

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 28: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2436 W Bialek I Nemenman and N Tishby

This bound lets us focus our attention on the region reg frac14 Nreg We expandthe exponent of the integrand of equation 454 around this point and per-form a simple gaussian integration In principle large uctuations mightlead to an instability (positive or zero curvature) at the saddle point butthis is atypical and therefore is accounted for already Curvatures at thesaddle points of both numerator and denominator are of the same orderand throwing away unimportant additive and multiplicative constants oforder unity we obtain the following result for the contribution of typicalsequences

S(f)typical1 raquo

ZdK NaP ( Nreg) dN Ex

Y

jQ(Exj | Nreg) N (B Aiexcl1 B)I

Bm D1N

X

i

log Q(Exi | Nreg)

Nam

hBiEx D 0I

(A)mordm D1N

X

i

2 log Q(Exi | Nreg)

Nam Naordm

hAiEx D F (458)

Here hcent cent centiEx means an averaging with respect to all Exirsquos keeping Nreg constantOne immediately recognizes that B and A are respectively rst and secondderivatives of the empirical KL divergence that was in the exponent of theinner integral in equation 454

We are dealing now with typical cases Therefore large deviations of Afrom F are not allowed and we may bound equation 458 by replacing Aiexcl1

with F iexcl1(1Cd) whered again is independent of N Now we have to averagea bunch of products like

log Q(Exi | Nreg)

Nam

(F iexcl1)mordm log Q(Exj | Nreg)

Naordm

(459)

over all Exirsquos Only the terms with i D j survive the averaging There areK2N such terms each contributing of order Niexcl1 This means that the totalcontribution of the typical uctuations is bounded by a number of orderone and does not grow with N

This concludes the proof of controllability of uctuations for dVC lt 1One may notice that we never used the specic form of M(2 N dVC) whichis the only thing dependent on the precise value of the dimension Actuallya more thorough look at the proof shows that we do not even need thestrict uniform convergence enforced by the Glivenko-Cantelli bound Withsome modications the proof should still hold if there exist some a prioriimprobable values of reg and Nreg that lead to violation of the bound That isif the prior P (reg) has sufciently narrow support then we may still expectuctuations to be unimportant even for VC-innite problems

A proof of this can be found in the realm of the structural risk mini-mization (SRM) theory (Vapnik 1998) SRM theory assumes that an innite

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 29: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2437

structure C of nested subsets C1 frac12 C2 frac12 C3 frac12 cent cent cent can be imposed onto theset C of all admissible solutions of a learning problem such that C D

SCn

The idea is that having a nite number of observations N one is connedto the choices within some particular structure element Cn n D n(N) whenlooking for an approximation to the true solution this prevents overttingand poor generalization As the number of samples increases and one isable to distinguish within more and more complicated subsets n grows IfdVC for learning in any Cn n lt 1 is nite then one can show convergenceof the estimate to the true value as well as the absolute smallness of uc-tuations (Vapnik 1998) It is remarkable that this result holds even if thecapacity of the whole set C is not described by a nite dVC

In the context of SRM the role of the prior P(reg) is to induce a structureon the set of all admissible densities and the ght between the numberof samples N and the narrowness of the prior is precisely what determineshow the capacity of the current element of the structure Cn n D n(N) growswith N A rigorous proof of smallness of the uctuations can be constructedbased on well-known results as detailed elsewhere (Nemenman 2000)Here we focus on the question of how narrow the prior should be so thatevery structure element is of nite VC dimension and one can guaranteeeventual convergence of uctuations to zero

Consider two examples A variable x is distributed according to the fol-lowing probability density functions

Q(x|a) D1

p2p

expmicro

iexcl12

(x iexcl a)2para

x 2 (iexcl1I C1)I (460)

Q(x|a) Dexp (iexcl sin ax)

R 2p0 dx exp (iexcl sin ax)

x 2 [0I 2p ) (461)

Learning the parameter in the rst case is a dVC D 1 problem in the secondcase dVC D 1 In the rst example as we have shown one may constructa uniform bound on uctuations irrespective of the prior P(reg) The sec-ond one does not allow this Suppose that the prior is uniform in a box0 lt a lt amax and zero elsewhere with amax rather large Then for not toomany sample points N the data would be better tted not by some valuein the vicinity of the actual parameter but by some much larger value forwhich almost all data points are at the crests of iexcl sin ax Adding a new datapoint would not help until that best but wrong parameter estimate is lessthan amax10 So the uctuations are large and the predictive information is

10 Interestingly since for the model equation 461 KL divergence is bounded frombelow and from above for amax 1 the weight in r (DI Na) at small DKL vanishes and anite weight accumulates at some nonzero value of D Thus even putting the uctuationsaside the asymptotic behavior based on the phase-space dimension is invalidated asmentioned above

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 30: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2438 W Bialek I Nemenman and N Tishby

small in this case Eventually however data points would overwhelm thebox size and the best estimate of a would swiftly approach the actual valueAt this point the argument of Clarke and Barron would become applicableand the leading behavior of the subextensive entropy would converge toits asymptotic value of (12) log N On the other hand there is no uniformbound on the value of N for which this convergence will occur it is guaran-teed only for N Agrave dVC which is never true if dVC D 1 For some sufcientlywide priors this asymptotically correct behavior would never be reachedin practice Furthermore if we imagine a thermodynamic limit where thebox size and the number of samples both become large then by anal-ogy with problems in supervised learning (Seung Sompolinsky amp Tishby1992 Haussler et al 1996) we expect that there can be sudden changesin performance as a function of the number of examples The argumentsof Clarke and Barron cannot encompass these phase transitions or ldquoAhardquophenomena A further bridge between VC dimension and the information-theoretic approach to learning may be found in Haussler Kearns amp Schapire(1994) where the authors bounded predictive information-like quantitieswith loose but useful bounds explicitly proportional to dVC

While much of learning theory has focused on problems with nite VCdimension itmight be that the conventional scenario in which the number ofexamples eventually overwhelms the number of parameters or dimensionsis too weak to deal with many real-world problems Certainly in the presentcontext there is not only a quantitative but also a qualitative differencebetween reaching the asymptotic regime in just a few measurements orin many millions of them Finitely parameterizable models with nite orinnite dVC fall in essentially different universality classes with respect tothe predictive information

45 Beyond Finite ParameterizationGeneral Considerations The pre-vious sections have considered learning from time series where the underly-ing class of possible models is described with a nite number of parametersIf the number of parameters is not nite then in principle it is impossible tolearn anything unless there is some appropriate regularization of the prob-lem If we let the number of parameters stay nite but become large thenthere is more to be learned and correspondingly the predictive informationgrows in proportion to this number as in equation 445 On the other handif the number of parameters becomes innite without regularization thenthe predictive information should go to zero since nothing can be learnedWe should be able to see this happen in a regularized problem as the regu-larization weakens Eventually the regularization would be insufcient andthe predictive information would vanish The only way this can happen isif the subextensive term in the entropy grows more and more rapidly withN as we weaken the regularization until nally it becomes extensive at thepoint where learning becomes impossible More precisely if this scenariofor the breakdown of learning is to work there must be situations in which

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 31: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2439

the predictive information grows with N more rapidly than the logarithmicbehavior found in the case of nite parameterization

Subextensive terms in the entropy are controlled by the density of modelsas a function of their KL divergence to the target model If the models havenite VC and phase-space dimensions then this density vanishes for smalldivergences as r raquo D(diexcl2)2 Phenomenologically if we let the number ofparameters increase the density vanishes more and more rapidly We canimagine that beyond the class of nitely parameterizable problems thereis a class of regularized innite dimensional problems in which the densityr (D 0) vanishes more rapidly than any power of D As an example wecould have

r (D 0) frac14 A expmicro

iexclB

Dm

para m gt 0 (462)

that is an essential singularity at D D 0 For simplicity we assume that theconstants A and B can depend on the target model but that the nature ofthe essential singularity (m ) is the same everywhere Before providing anexplicit example let us explore the consequences of this behavior

From equation 437 we can write the partition function as

Z( NregI N) DZ

dDr (DI Nreg) exp[iexclND]

frac14 A( Nreg)Z

dD expmicro

iexclB( Nreg)

Dmiexcl ND

para

frac14 QA( Nreg) expmicro

iexcl12

m C 2

m C 1ln N iexcl C( Nreg)Nm (m C1)

para (463)

where in the last step we use a saddle-point or steepest-descent approxima-tion that is accurate at large N and the coefcients are

QA( Nreg) D A( Nreg)micro 2p m

1(m C1)

m C 1

para12

cent [B( Nreg)]1(2m C2) (464)

C( Nreg) D [B( Nreg)]1 (m C1)micro 1

mm (m C1) C m1 (m C1)

para (465)

Finally we can use equations 444 and 463 to compute the subextensiveterm in the entropy keeping only the dominant term at large N

S(a)1 (N)

1ln 2

hC( Nreg)iNaNm (m C1) (bits) (466)

where hcent cent centiNa denotes an average over all the target models

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 32: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2440 W Bialek I Nemenman and N Tishby

This behavior of the rst subextensive term is qualitatively differentfrom everything we have observed so far A power-law divergence is muchstronger than a logarithmic one Therefore a lot more predictive informa-tion is accumulated in an ldquoinnite parameterrdquo (or nonparametric) systemthe system is intuitively and quantitatively much richer and more complex

Subextensive entropy also grows as a power law in a nitely parameter-izable system with a growing number of parameters For example supposethat we approximate the distribution of a random variable by a histogramwith K bins and we let K grow with the quantity of available samples asK raquo Nordm A straightforward generalization of equation 445 seems to implythen that S1(N) raquo Nordm log N (Hall amp Hannan 1988 Barron amp Cover 1991)While not necessarily wrong analogies of this type are dangerous If thenumber of parameters grows constantly then the scenario where the dataoverwhelm all the unknowns is far from certain Observations may providemuch less information about features that were introduced into the model atsome large N than about those that have existed since the very rst measure-ments Indeed in a K-parameter system the Nth sample point contributesraquo K2N bits to the subextensive entropy (cf equation 445) If K changesas mentioned the Nth example then carries raquo Nordmiexcl1 bits Summing this upover all samples we nd S

(a)1 raquo Nordm and if we let ordm D m (m C 1) we obtain

equation 466 note that the growth of the number of parameters is slowerthan N (ordm D m (m C 1) lt 1) which makes sense Rissanen Speed and Yu(1992) made a similar observation According to them for models with in-creasing number of parameters predictive codes which are optimal at eachparticular N (cf equation 38) provide a strictly shorter coding length thannonpredictive codes optimized for all data simultaneously This has to becontrasted with the nite-parameter model classes for which these codesare asymptotically equal

Power-law growth of the predictive information illustrates the pointmade earlier about the transition from learning more to nally learningnothing as the class of investigated models becomes more complex Asm increases the problem becomes richer and more complex and this isexpressed in the stronger divergence of the rst subextensive term of theentropy for xed large N the predictive information increases with m How-ever if m 1 the problem is too complex for learning in our model ex-ample the number of bins grows in proportion to the number of sampleswhich means that we are trying to nd too much detail in the underlyingdistribution As a result the subextensive term becomes extensive and stopscontributing to predictive information Thus at least to the leading orderpredictability is lost as promised

46 Beyond Finite Parameterization Example While literature onproblems in the logarithmic class is reasonably rich the research on es-tablishing the power-law behavior seems to be in its early stages Someauthors have found specic learning problems for which quantities similar

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 33: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2441

to but sometimes very nontrivially related to S1 are bounded by powerndashlaw functions (Haussler amp Opper 1997 1998 Cesa-Bianchi amp Lugosi inpress) Others have chosen to study nite-dimensional models for whichthe optimal number of parameters (usually determined by the MDL crite-rion of Rissanen 1989) grows as a power law (Hall amp Hannan 1988 Rissa-nen et al 1992) In addition to the questions raised earlier about the dangerof copying nite-dimensional intuition to the innite-dimensional setupthese are not examples of truly nonparametric Bayesian learning Insteadthese authors make use of a priori constraints to restrict learning to codesof particular structure (histogram codes) while a non-Bayesian inference isconducted within the class Without Bayesian averaging and with restric-tions on the coding strategy it may happen that a realization of the codelength is substantially different from the predictive information Similarconceptual problems plague even true nonparametric examples as consid-ered for example by Barron and Cover (1991) In summary we do notknow of a complete calculation in which a family of power-law scalings ofthe predictive information is derived from a Bayesian formulation

The discussion in the previous section suggests that we should lookfor power-law behavior of the predictive information in learning problemswhere rather than learning ever more precise values for a xed set of pa-rameters we learn a progressively more detailed descriptionmdasheffectivelyincreasing the number of parametersmdashas we collectmoredata One exampleof such a problem is learning the distribution Q(x) for a continuous variablex but rather than writing a parametric form of Q(x) we assume only thatthis function itself is chosen from some distribution that enforces a degreeof smoothness There are some natural connections of this problem to themethods of quantum eld theory (Bialek Callan amp Strong 1996) which wecan exploit to give a complete calculation of the predictive information atleast for a class of smoothness constraints

We write Q(x) D (1 l0) exp[iexclw (x)] so that the positivity of the distribu-tion is automatic and then smoothness may be expressed by saying that theldquoenergyrdquo (or action) associated with a function w (x) is related to an integralover its derivatives like the strain energy in a stretched string The simplestpossibility following this line of ideas is that the distribution of functions isgiven by

P[w (x)] D1

Zexp

iexcl l

2

Zdx

sup3w

x

acute2d

micro 1l0

Zdx eiexclw (x) iexcl 1

para (467)

where Z is the normalization constant for P[w] the delta function ensuresthat each distribution Q(x) is normalized and l sets a scale for smooth-ness If distributions are chosen from this distribution then the optimalBayesian estimate of Q(x) from a set of samples x1 x2 xN converges tothe correct answer and the distribution at nite N is nonsingular so that theregularization provided by this prior is strong enough to prevent the devel-

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 34: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2442 W Bialek I Nemenman and N Tishby

opment of singular peaks at the location of observed data points (Bialek etal 1996) Further developments of the theory including alternative choicesof P[w (x)] have been given by Periwal (1997 1998) Holy (1997) Aida (1999)and Schmidt (2000) for a detailed numerical investigation of this problemsee Nemenman and Bialek (2001) Our goal here is to be illustrative ratherthan exhaustive11

From the discussion above we know that the predictive information isrelated to the density of KL divergences and that the power-law behavior weare looking for comes from an essential singularity in this density functionThus we calculate r (D Nw ) in the model dened by equation 467

With Q(x) D (1 l0) exp[iexclw (x)] we can write the KL divergence as

DKL[ Nw (x)kw (x)] D1l0

Zdx exp[iexcl Nw (x)][w (x) iexcl Nw (x)] (468)

We want to compute the density

r (DI Nw ) DZ

[dw (x)]P[w (x)]diexclD iexcl DKL[ Nw (x)kw (x)]

cent(469)

D MZ

[dw (x)]P[w (x)]diexclMD iexcl MDKL[ Nw (x)kw (x)]

cent (470)

where we introduce a factor M which we will allow to become large so thatwe can focus our attention on the interesting limit D 0 To compute thisintegral over all functions w (x) we introduce a Fourier representation forthe delta function and then rearrange the terms

r (DI Nw ) D MZ dz

2pexp(izMD)

Z[dw (x)]P [w (x)] exp(iexclizMDKL) (471)

D MZ dz

2pexp

sup3izMD C

izMl0

Zdx Nw (x) exp[iexcl Nw (x)]

acute

poundZ

[dw (x)]P[w (x)]

pound expsup3

iexclizMl0

Zdx w (x) exp[iexcl Nw (x)]

acute (472)

The inner integral over the functions w (x) is exactly the integral evaluatedin the original discussion of this problem (Bialek Callan amp Strong 1996)in the limit that zM is large we can use a saddle-point approximationand standard eld-theoreticmethods allow us to compute the uctuations

11 We caution that our discussion in this section is less self-contained than in othersections Since the crucial steps exactly parallel those in the earlier work here we just givereferences

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 35: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2443

around the saddle point The result is that

Z[dw (x)]P[w (x)] exp

sup3iexcl

izMl0

Zdx w (x) exp[iexcl Nw (x)]

acute

D expsup3

iexclizMl0

Zdx Nw (x) exp[iexcl Nw (x)] iexcl Seff[ Nw (x)I zM]

acute (473)

Seff[ NwI zM] Dl2

Zdx

sup3 Nwx

acute2

C12

sup3 izMll0

acute12Zdx exp[iexcl Nw (x)2] (474)

Now we can do the integral over z again by a saddle-point method Thetwo saddle-point approximations are both valid in the limit that D 0 andMD32 1 we are interested precisely in the rst limit and we are free toset M as we wish so this gives us a good approximation for r (D 0I Nw )This results in

r (D 0I Nw ) D A[ Nw (x)]Diexcl32 expsup3

iexclB[ Nw (x)]D

acute (475)

A[ Nw (x)] D1

p16p ll0

expmicro

iexcll2

Zdx

iexclx Nw

cent2para

poundZ

dx exp[iexcl Nw (x)2] (476)

B[ Nw (x)] D1

16ll0

sup3Zdx exp[iexcl Nw (x)2]

acute2

(477)

Except for the factor of Diexcl32 this is exactly the sort of essential singularitythat we considered in the previous section with m D 1 The Diexcl32 prefactordoes not change the leading large N behavior of the predictive informationand we nd that

S(a)1 (N) raquo

1

2 ln 2p

ll0

frac12 Zdx exp[iexcl Nw (x)2]

frac34

NwN12 (478)

where hcent cent centi Nw denotes an average over the target distributions Nw (x) weightedonce again by P[ Nw (x)] Notice that if x is unbounded then the average inequation 478 is infrared divergent if instead we let the variable x range from0 to L then this average should be dominated by the uniform distributionReplacing the average by its value at this point we obtain the approximateresult

S(a)1 (N) raquo

12 ln 2

pN

sup3 Ll

acute12

bits (479)

To understand the result in equation 479 we recall that this eld-theoreticapproach is more or less equivalent to an adaptive binning procedure inwhich we divide the range of x into bins of local size

plNQ(x) (Bialek

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 36: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2444 W Bialek I Nemenman and N Tishby

Callan amp Strong 1996) From this point of view equation 479 makes perfectsense the predictive information is directly proportional to the number ofbins that can be put in the range of x This also is in direct accord witha comment in the previous section that power-law behavior of predictiveinformationarises from the number ofparameters in the problemdependingon the number of samples

This counting of parameters allows a schematic argument about thesmallness of uctuations in this particular nonparametric problem If wetake the hint that at every step raquo

pN bins are being investigated then we

can imagine that the eld-theoretic prior in equation 467 has imposed astructure C on the set of all possible densities so that the set Cn is formed ofall continuous piecewise linear functions that have not more than n kinksLearning such functions for nite n is a dVC D n problem Now as N growsthe elements with higher capacities n raquo

pN are admitted The uctuations

in such a problem are known to be controllable (Vapnik 1998) as discussedin more detail elsewhere (Nemenman 2000)

One thing that remains troubling is that the predictive information de-pends on the arbitrary parameter l describing the scale of smoothness inthe distribution In the original work it was proposed that one should in-tegrate over possible values of l (Bialek Callan amp Strong 1996) Numericalsimulations demonstrate that this parameter can be learned from the datathemselves (Nemenman amp Bialek 2001) but perhaps even more interest-ing is a formulation of the problem by Periwal (1997 1998) which recoverscomplete coordinate invariance by effectively allowing l to vary with x Inthis case the whole theory has no length scale and there also is no need toconne the variable x to a box (here of size L) We expect that this coordinateinvariant approach will lead to a universal coefcient multiplying

pN in

the analog of equation 479 but this remains to be shownIn summary the eld-theoretic approach to learning a smooth distri-

bution in one dimension provides us with a concrete calculable exampleof a learning problem with power-law growth of the predictive informa-tion The scenario is exactly as suggested in the previous section where thedensity of KL divergences develops an essential singularity Heuristic con-siderations (Bialek Callan amp Strong 1996 Aida 1999) suggest that differentsmoothness penalties and generalizations to higher-dimensional problemswill have sensible effects on the predictive informationmdashall have power-law growth smoother functions have smaller powers (less to learn) andhigher-dimensional problems have larger powers (more to learn)mdashbut realcalculations for these cases remain challenging

5 Ipred as a Measure of Complexity

The problemof quantifying complexity isvery old(see Grassberger 1991 fora short review) Solomonoff (1964) Kolmogorov (1965) and Chaitin (1975)investigated a mathematically rigorous notion of complexity that measures

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 37: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2445

(roughly) the minimum length of a computer program that simulates theobserved time series (see also Li amp Vit Acircanyi 1993) Unfortunately there isno algorithm that can calculate the Kolmogorov complexity of all data setsIn addition algorithmic or Kolmogorov complexity is closely related to theShannon entropy which means that it measures something closer to ourintuitive concept of randomness than to the intuitive concept of complexity(as discussed for example by Bennett 1990) These problems have fueledcontinued research along two different paths representing two major mo-tivations for dening complexity First one would like to make precise animpression that some systemsmdashsuch as life on earth or a turbulent uidowmdashevolve toward a state of higher complexity and one would like tobe able to classify those states Second in choosing among different modelsthat describe an experiment one wants to quantify a preference for simplerexplanations or equivalently provide a penalty for complex models thatcan be weighed against the more conventional goodness-of-t criteria Webring our readers up to date with some developments in both of these di-rections and then discuss the role of predictive information as a measure ofcomplexity This also gives us an opportunity to discuss more carefully therelation of our results to previous work

51 Complexity of Statistical Models The construction of complexitypenalties for model selection is a statistics problem As far as we know how-ever the rst discussions of complexity in this context belong to the philo-sophical literature Even leaving aside the early contributions of William ofOccam on the need for simplicity Hume on the problem of induction andPopper on falsiability Kemeney (1953) suggested explicitly that it wouldbe possible to create a model selection criterion that balances goodness oft versus complexity Wallace and Boulton (1968) hinted that this balancemay result in the model with ldquothe briefest recording of all attribute informa-tionrdquo Although he probably had a somewhat different motivation Akaike(1974a 1974b) made the rst quantitative step along these lines His ad hoccomplexity term was independent of the number of data points and wasproportional to the number of free independent parameters in the model

These ideas were rediscovered and developed systematically by Rissa-nen in a series of papers starting from 1978 He has emphasized strongly(Rissanen 1984 1986 1987 1989) that tting a model to data represents anencoding of those data or predicting future data and that in searching foran efcient code we need to measure not only the number of bits required todescribe the deviations of the data from the modelrsquos predictions (goodnessof t) but also the number of bits required to specify the parameters of themodel (complexity) This specication has to be done to a precision sup-ported by the data12 Rissanen (1984) and Clarke and Barron (1990) in full

12 Within this framework Akaikersquos suggestion can be seen as coding the model to(suboptimal) xed precision

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 38: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2446 W Bialek I Nemenman and N Tishby

generality were able to prove that the optimalencoding of a model requires acode with length asymptotically proportional to the number of independentparameters and logarithmically dependent on the number of data points wehave observed The minimal amount of space one needs to encode a datastring (minimum description length or MDL) within a certain assumedmodel class was termed by Rissanen stochastic complexity and in recentwork he refers to the piece of the stochastic complexity required for cod-ing the parameters as the model complexity (Rissanen 1996) This approachwas further strengthened by a recent result (Vit Acircanyi amp Li 2000) that an es-timation of parameters using the MDL principle is equivalent to Bayesianparameter estimations with a ldquouniversalrdquo prior (Li amp Vit Acircanyi 1993)

There should be a close connection between Rissanenrsquos ideas of encodingthe data stream and the subextensive entropy We are accustomed to the ideathat the average length of a code word for symbols drawn froma distributionP isgiven by the entropyof that distribution thus it is tempting to say that anencoding of a stream x1 x2 xN will require an amount of space equal tothe entropy of the joint distribution P(x1 x2 xN ) The situation here is abit moresubtle because the usual proofsof equivalence between code lengthand entropy rely on notions of typicality and asymptotics as we try to encodesequences of many symbols Here we already have N symbols and so it doesnot really make sense to talk about a stream of streams One can argue how-ever that atypical sequences are not truly random within a considered distri-bution since their codingby the methods optimized for the distribution is notoptimal So atypical sequences are better considered as typical ones comingfrom a different distribution (a point also made by Grassberger 1986) Thisallows us to identify properties of an observed (long) string with the proper-ties of the distribution it comes fromas Vit Acircanyi and Li (2000) did Ifwe acceptthis identication of entropy with code length then Rissanenrsquos stochasticcomplexity should be the entropy of the distribution P(x1 x2 xN )

As emphasized by Balasubramanian (1997) the entropy of the joint dis-tribution of N points can be decomposed into pieces that represent noiseor errors in the modelrsquos local predictionsmdashan extensive entropymdashand thespace required to encode the model itself which must be the subexten-sive entropy Since in the usual formulation all long-term predictions areassociated with the continued validity of the model parameters the domi-nant component of the subextensive entropy must be this parameter coding(or model complexity in Rissanenrsquos terminology) Thus the subextensiveentropy should be the model complexity and in simple cases where wecan describe the data by a K-parameter model both quantities are equal to(K2) log2 N bits to the leading order

The fact that the subextensive entropy or predictive information agreeswith Rissanenrsquos model complexity suggests that Ipred provides a reasonablemeasure of complexity in learning problems This agreement might lead thereader to wonder if all we have done is to rewrite the results of Rissanenet al in a different notation To calm these fears we recall again that our

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 39: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2447

approach distinguishes innite VC problems from nite ones and treatsnonparametric cases as well Indeed the predictive information is denedwithout reference to the idea that we are learning a model and thus we canmake a link to physical aspects of the problem as discussed below

The MDL principlewas introduced as a procedure for statistical inferencefrom a data stream to a model In contrast we take the predictive informa-tion to be a characterization of the data stream itself To the extent that wecan think of the data stream as arising from a model with unknown pa-rameters as in the examples of section 4 all notions of inference are purelyBayesian and there is no additional ldquopenalty for complexityrdquo In this senseour discussion is much closer in spirit to Balasubramanian (1997) than toRissanen (1978) On the other hand we can always think about the ensem-ble of data streams that can be generated by a class of models providedthat we have a proper Bayesian prior on the members of the class Then thepredictive information measures the complexity of the class and this char-acterization can be used to understand why inference within some (simpler)model classes will be more efcient (for practical examples along these linessee Nemenman 2000 and Nemenman amp Bialek 2001)

52 Complexity of Dynamical Systems While there are a few attemptsto quantify complexityin terms of deterministic predictions(Wolfram1984)the majority of efforts to measure the complexity of physical systems startwith a probabilistic approach In addition there is a strong prejudice thatthe complexity of physical systems should be measured by quantities thatare not only statistical but are also at least related to more conventional ther-modynamic quantities (for example temperature entropy) since this is theonly way one will be able to calculate complexity within the framework ofstatistical mechanics Most proposals dene complexity as an entropy-likequantity but an entropy of some unusual ensemble For example Lloydand Pagels (1988) identied complexity as thermodynamic depth the en-tropy of the state sequences that lead to the current state The idea clearlyis in the same spirit as the measurement of predictive information but thisdepth measure does not completely discard the extensive component of theentropy (Crutcheld amp Shalizi 1999) and thus fails to resolve the essentialdifculty in constructing complexity measures for physical systems distin-guishing genuine complexity from randomness (entropy) the complexityshould be zero for both purely regular and purely random systems

New denitions of complexity that try to satisfy these criteria (Lopez-Ruiz Mancini amp Calbet 1995 Gell-Mann amp Lloyd 1996 Shiner Davisonamp Landsberger 1999 Sole amp Luque 1999 Adami amp Cerf 2000) and criti-cisms of these proposals (Crutcheld Feldman amp Shalizi 2000 Feldman ampCrutcheld 1998 Sole amp Luque 1999) continue to emerge even now Asidefrom the obvious problems of not actually eliminating the extensive com-ponent for all or a part of the parameter space or not expressing complexityas an average over a physical ensemble the critiques often are based on

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 40: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2448 W Bialek I Nemenman and N Tishby

a clever argument rst mentioned explicitly by Feldman and Crutcheld(1998) In an attempt to create a universal measure the constructions can bemade overndashuniversal many proposed complexity measures depend onlyon the entropy density S0 and thus are functions only of disordermdashnot adesired feature In addition many of these and other denitions are awedbecause they fail to distinguish among the richness of classes beyond somevery simple ones

In a series of papers Crutcheld and coworkers identied statistical com-plexity with the entropy of causal states which in turn are dened as allthose microstates (or histories) that have the same conditional distributionof futures (Crutcheld amp Young 1989 Shalizi amp Crutcheld 1999) Thecausal states provide an optimal description of a systemrsquos dynamics in thesense that these states make as good a prediction as the histories them-selves Statistical complexity is very similar to predictive information butShalizi and Crutcheld (1999) dene a quantity that is even closer to thespirit of our discussion their excess entropy is exactly the mutual informa-tion between the semi-innite past and future Unfortunately by focusingon cases in which the past and future are innite but the excess entropyis nite their discussion is limited to systems for which (in our language)Ipred(T 1) D constant

In our view Grassberger (1986 1991) has made the clearest and the mostappealing denitions He emphasized that the slow approach of the en-tropy to its extensive limit is a sign of complexity and has proposed severalfunctions to analyze this slow approach His effective measure complexityis the subextensive entropy term of an innite data sample Unlike Crutch-eld et al he allows this measure to grow to innity As an example forlow-dimensional dynamical systems the effective measure complexity isnite whether the system exhibits periodic or chaotic behavior but at thebifurcation point that marks the onset of chaos it diverges logarithmicallyMore interesting Grassberger also notes that simulations of specic cellularautomaton models that are capable of universal computation indicate thatthese systems can exhibit an even stronger power-law divergence

Grassberger (1986 1991) also introduces the true measure complexity orthe forecasting complexity which is the minimal information one needs toextract from the past in order to provide optimal prediction Another com-plexity measure the logical depth (Bennett 1985) which measures the timeneeded to decode the optimal compression of the observed data is boundedfrom below by this quantity because decoding requires reading all of thisinformation In addition true measure complexity is exactly the statisti-cal complexity of Crutcheld et al and the two approaches are actuallymuch closer than they seem The relation between the true and the effectivemeasure complexities or between the statistical complexity and the excessentropy closely parallels the idea of extracting or compressing relevant in-formation (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation)as discussed below

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 41: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2449

53 A Unique Measure of Complexity We recall that entropy providesa measure of information that is unique in satisfying certain plausible con-straints (Shannon 1948) It would be attractive if we could prove a sim-ilar uniqueness theorem for the predictive information or any part of itas a measure of the complexity or richness of a time-dependent signalx(0 lt t lt T) drawn from a distribution P[x(t)] Before proceeding alongthese lines we have to be more specic about what we mean by ldquocomplex-ityrdquo In most cases including the learning problems discussed above it isclear that we want to measure complexity of the dynamics underlying thesignal or equivalently the complexity of a model that might be used todescribe the signal13 There remains a question however as to whether wewant to attach measures of complexity to a particular signal x(t) or whetherwe are interested in measures (like the entropy itself) that are dened by anaverage over the ensemble P[x(t)]

One problem in assigning complexity to single realizations is that therecan be atypical data streams Either we must treat atypicality explicitlyarguing that atypical data streams from one source should be viewed astypical streams from another source as discussed by Vit Acircanyi and Li (2000)or we have to look at average quantities Grassberger (1986) in particularhas argued that our visual intuition about the complexity of spatial patternsis an ensemble concept even if the ensemble is only implicit (see also Tongin the discussion session of Rissanen 1987) In Rissanenrsquos formulation ofMDL one tries to compute the description length of a single string withrespect to some class of possible models for that string but if these modelsare probabilistic we can always think about these models as generating anensemble of possible strings The fact that we admit probabilistic models iscrucial Even at a colloquial level if we allow for probabilistic models thenthere is a simple description for a sequence of truly random bits but if weinsist on a deterministic model then it may be very complicated to generateprecisely the observed string of bits14 Furthermore in the context of proba-bilistic models it hardly makes sense to ask for a dynamics that generates aparticular data streamwe must ask fordynamics that generate the data withreasonable probability which is more or less equivalent to asking that thegiven string be a typical member of the ensemble generated by the modelAll of these paths lead us to thinking not about single strings but aboutensembles in the tradition of statistical mechanics and so we shall searchfor measures of complexity that are averages over the distribution P[x(t)]

13 The problem of nding this model or of reconstructing the underlying dynamicsmay also be complex in the computational sense so that there may not exist an efcientalgorithmMore interesting the computational effort required maygrowwith thedurationT of our observations We leave these algorithmic issues aside for this discussion

14 This is the statement that the Kolmogorov complexity of a random sequence is largeThe programs or algorithms considered in the Kolmogorov formulation are deterministicand the program must generate precisely the observed string

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 42: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2450 W Bialek I Nemenman and N Tishby

Once we focus on average quantities we can start by adopting Shannonrsquospostulates as constraints on a measure of complexity if there are N equallylikely signals then the measure should be monotonic in N if the signal isdecomposable into statistically independent parts then the measure shouldbe additive with respect to this decomposition and if the signal can bedescribed as a leaf on a tree of statistically independent decisions then themeasure should be a weighted sum of the measures at each branching pointWe believe that these constraints are as plausible for complexity measuresas for information measures and it is well known from Shannonrsquos originalwork that this set of constraints leaves the entropy as the only possibilitySince we are discussing a time-dependent signal this entropy depends onthe duration of our sample S(T) We know of course that this cannot be theend of the discussion because we need to distinguish between randomness(entropy) and complexity The path to this distinction is to introduce otherconstraints on our measure

First we notice that if the signal x is continuous then the entropy isnot invariant under transformations of x It seems reasonable to ask thatcomplexity be a function of the process we are observing and not of thecoordinate system in which we choose to record our observations The ex-amples above show us however that it is not the whole function S(T) thatdepends on the coordinate system for x15 it is only the extensive componentof the entropy that has this noninvariance This can be seen more generallyby noting that subextensive terms in the entropy contribute to the mutualinformation among different segments of the data stream (including thepredictive information dened here) while the extensive entropy cannotmutual information is coordinate invariant so all of the noninvariance mustreside in the extensive term Thus any measure complexity that is coordi-nate invariant must discard the extensive component of the entropy

The fact that extensive entropy cannot contribute to complexity is dis-cussed widely in the physics literature (Bennett 1990) as our short re-view shows To statisticians and computer scientists who are used to Kol-mogorovrsquos ideas this is less obvious However Rissanen (1986 1987) alsotalks about ldquonoiserdquo and ldquouseful informationrdquo in a data sequence which issimilar to splitting entropy into its extensive and the subextensive parts Hisldquomodel complexityrdquo aside from not being an average as required above isessentially equal to the subextensive entropy Similarly Whittle (in the dis-cussion of Rissanen 1987) talks about separating the predictive part of thedata from the rest

If we continue along these lines we can think about the asymptotic ex-pansion of the entropy at large T The extensive term is the rst term inthis series and we have seen that it must be discarded What about the

15 Here we consider instantaneous transformations of x not ltering or other transfor-mations that mix points at different times

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 43: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2451

other terms In the context of learning a parameterized model most of theterms in this series depend in detail on our prior distribution in parameterspace which might seem odd for a measure of complexity More gener-ally if we consider transformations of the data stream x(t) that mix pointswithin a temporal window of size t then for T Agrave t the entropy S(T)may have subextensive terms that are constant and these are not invariantunder this class of transformations On the other hand if there are diver-gent subextensive terms these are invariant under such temporally localtransformations16 So if we insist that measures of complexity be invariantnot only under instantaneous coordinate transformations but also undertemporally local transformations then we can discard both the extensiveand the nite subextensive terms in the entropy leaving only the divergentsubextensive terms as a possible measure of complexity

An interesting example of these arguments is provided by the statisti-cal mechanics of polymers It is conventional to make models of polymersas random walks on a lattice with various interactions or self-avoidanceconstraints among different elements of the polymer chain If we count thenumber N of walks with N steps we nd that N (N) raquo ANc zN (de Gennes1979) Now the entropy is the logarithm of the number of states and so thereis an extensive entropy S0 D log2 z a constant subextensive entropy log2 Aand a divergent subextensive term S1(N) c log2 N Of these three termsonly the divergent subextensive term (related to the critical exponent c ) isuniversal that is independent of the detailed structure of the lattice Thusas in our general argument it is only the divergent subextensive terms inthe entropy that are invariant to changes in our description of the localsmall-scale dynamics

We can recast the invariance arguments in a slightly different form usingthe relative entropy We recall that entropy is dened cleanly only for dis-crete processes and that in the continuum there are ambiguities We wouldlike to write the continuum generalization of the entropy of a process x(t)distributed according to P[x(t)] as

Scont D iexclZ

dx(t) P[x(t)] log2 P[x(t)] (51)

but this is not well dened because we are taking the logarithm of a dimen-sionful quantity Shannon gave the solution to this problem we use as ameasure of information the relative entropy or KL divergence between thedistribution P[x(t)] and some reference distribution Q[x(t)]

Srel D iexclZ

dx(t) P[x(t)] log2

sup3 P[x(t)]Q[x(t)]

acute (52)

16 Throughout this discussion we assume that the signal x at one point in time is nitedimensional There are subtleties if we allow x to represent the conguration of a spatiallyinnite system

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 44: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2452 W Bialek I Nemenman and N Tishby

which is invariant under changes of our coordinate system on the space ofsignals The cost of this invariance is that we have introduced an arbitrarydistribution Q[x(t)] and so really we have a family of measures We cannd a unique complexity measure within this family by imposing invari-ance principles as above but in this language we must make our measureinvariant to different choices of the reference distribution Q[x(t)]

The reference distribution Q[x(t)] embodies our expectations for the sig-nal x(t) in particular Srel measures the extra space needed to encode signalsdrawn from the distribution P[x(t)] if we use coding strategies that are op-timized for Q[x(t)] If x(t) is a written text two readers who expect differentnumbers of spelling errors will have different Qs but to the extent thatspelling errors can be corrected by reference to the immediate neighboringletters we insist that any measure of complexity be invariant to these dif-ferences in Q On the other hand readers who differ in their expectationsabout the global subject of the text may well disagree about the richness ofa newspaper article This suggests that complexity is a component of therelative entropy that is invariant under some class of local translations andmisspellings

Suppose that we leave aside global expectations and construct our ref-erence distribution Q[x(t)] by allowing only for short-ranged interactionsmdashcertain letters tend to followone another letters form words and so onmdashbutwe bound the range over which these rules are applied Models of this classcannot embody the full structure of most interesting time series (includ-ing language) but in the present context we are not asking for this On thecontrary we are looking for a measure that is invariant to differences inthis short-ranged structure In the terminology of eld theory or statisticalmechanics we are constructing our reference distribution Q[x(t)] from localoperators Because we are considering a one-dimensional signal (the one di-mension being time) distributions constructed from local operators cannothave any phase transitions as a function of parameters Again it is importantthat the signal x at one point in time is nite dimensional The absence ofcritical points means that the entropy of these distributions (or their contri-bution to the relative entropy) consists of an extensive term (proportionalto the time window T) plus a constant subextensive term plus terms thatvanish as T becomes large Thus if we choose different reference distribu-tions within the class constructible from local operators we can change theextensive component of the relative entropy and we can change constantsubextensive terms but the divergent subextensive terms are invariant

To summarize the usual constraints on information measures in thecontinuum produce a family of allowable complexity measures the relativeentropy to an arbitrary reference distribution If we insist that all observerswho choose reference distributions constructed from local operators arriveat the same measure of complexity or if we follow the rst line of argumentspresented above then this measure must be the divergent subextensive com-ponent of the entropy or equivalently the predictive information We have

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 45: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2453

seen that this component is connected to learning in a straightforward wayquantifying the amount that can be learned about dynamics that generatethe signal and to measures of complexity that have arisen in statistics anddynamical systems theory

6 Discussion

We have presented predictive information as a characterization of variousdata streams In the context of learning predictive information is relateddirectly to generalization More generally the structure or order in a timeseries or a sequence is related almost by denition to the fact that there ispredictability along the sequence The predictive information measures theamount of such structure but does not exhibit the structure in a concreteform Having collected a data stream of duration T what are the features ofthese data that carry the predictive information Ipred(T) From equation 37we know that most of what we have seen over the time T must be irrelevantto the problem of prediction so that the predictive information is a smallfraction of the total information Can we separate these predictive bits fromthe vast amount of nonpredictive data

The problem of separating predictive from nonpredictive information isa special case of the problem discussed recently (Tishby Pereira amp Bialek1999 Bialek amp Tishby in preparation) given some data x how do we com-press our description of x while preserving as much information as pos-sible about some other variable y Here we identify x D xpast as the pastdata and y D xfuture as the future When we compress xpast into some re-duced description Oxpast we keep a certain amount of information aboutthe past I( OxpastI xpast) and we also preserve a certain amount of informa-tion about the future I( OxpastI xfuture) There is no single correct compressionxpast Oxpast instead there is a one-parameter family of strategies that traceout an optimal curve in the plane dened by these two mutual informationsI( OxpastI xfuture) versus I( OxpastI xpast)

The predictive information preserved by compression must be less thanthe total so that I( OxpastI xfuture) middot Ipred(T) Generically no compression canpreserve all of the predictive information so that the inequality will be strictbut there are interesting special cases where equality can be achieved Ifprediction proceeds by learning a model with a nite number of parameterswe might have a regression formula that species the best estimate of theparameters given the past data Using the regression formula compressesthe data but preserves all of the predictive power In cases like this (moregenerally if there exist sufcient statistics for the prediction problem) wecan ask for the minimal set of Oxpast such that I( OxpastI xfuture) D Ipred(T) Theentropy of this minimal Oxpast is the true measure complexity dened byGrassberger (1986) or the statistical complexity dened by Crutcheld andYoung (1989) (in the framework of the causal states theory a very similarcomment was made recently by Shalizi amp Crutcheld 2000)

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 46: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2454 W Bialek I Nemenman and N Tishby

In the context of statistical mechanics long-range correlations are charac-terized by computing the correlation functions of order parameters whichare coarse-grained functions of the systemrsquos microscopic variables Whenwe know something about the nature of the orderparameter (eg whether itis a vector or a scalar) then general principles allow a fairly complete classi-cation and description of long-range ordering and the nature of the criticalpoints at which this order can appear or change On the other hand deningthe order parameter itself remains something of an art For a ferromagnetthe order parameter is obtained by local averaging of the microscopic spinswhile for an antiferromagnet one must average the staggered magnetiza-tion to capture the fact that the ordering involves an alternation from site tosite and so on Since the order parameter carries all the information that con-tributes to long-range correlations in space and time it might be possible todene order parameters more generally as those variables that provide themost efcient compression of the predictive information and this should beespecially interesting for complex or disordered systems where the natureof the order is not obvious intuitively a rst try in this direction was madeby Bruder (1998) At critical points the predictive information will divergewith the size of the system and the coefcients of these divergences shouldbe related to the standard scaling dimensions of the order parameters butthe details of this connection need to be worked out

If we compress or extract the predictive information from a time serieswe are in effect discovering ldquofeaturesrdquo that capture the nature of the or-dering in time Learning itself can be seen as an example of this wherewe discover the parameters of an underlying model by trying to compressthe information that one sample of N points provides about the next andin this way we address directly the problem of generalization (Bialek andTishby in preparation) The fact that nonpredictive information is uselessto the organism suggests that one crucial goal of neural information pro-cessing is to separate predictive information from the background Perhapsrather than providing an efcient representation of the current state of theworldmdashas suggested by Attneave (1954) Barlow (1959 1961) and others(Atick 1992)mdashthe nervous system provides an efcient representation ofthe predictive information17 It should be possible to test this directly by

17 If as seems likely the stream of data reaching our senses has diverging predictiveinformation then the space required to write down our description grows and growsas we observe the world for longer periods of time In particular if we can observe fora very long time then the amount that we know about the future will exceed by anarbitrarily large factor the amount that we know about the present Thus representingthe predictive information may require many more neurons than would be required torepresent the current data If we imagine that the goal of primary sensory cortex is torepresent the current state of the sensory world then it is difcult to understand whythese cortices have so many more neurons than they have sensory inputs In the extremecase the region of primary visual cortex devoted to inputs from the fovea has nearly 30000neurons for each photoreceptor cell in the retina (Hawken amp Parker 1991) although much

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 47: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2455

studying the encoding of reasonably natural signals and asking if the infor-mation that neural responses provide about the future of the input is closeto the limit set by the statistics of the input itself given that the neuron onlycaptures a certain number of bits about the past Thus we might ask if un-der natural stimulus conditions a motion-sensitive visual neuron capturesfeatures of the motion trajectory that allow for optimal prediction or extrap-olation of that trajectory by using information-theoretic measures we bothtest the ldquoefcient representationrdquo hypothesis directly and avoid arbitraryassumptions about the metric for errors in prediction For more complexsignals such as communication sounds even identifying the features thatcapture the predictive information is an interesting problem

It is natural to ask if these ideas about predictive information could beused to analyze experiments on learning in animals or humans We have em-phasized the problem of learning probability distributions or probabilisticmodels rather than learning deterministic functions associations or rulesIt is known that the nervous system adapts to the statistics of its inputs andthis adaptation is evident in the responses of single neurons (Smirnakis et al1997 Brenner Bialek amp de Ruyter van Steveninck 2000) these experimentsprovide a simple example of the system learning a parameterized distri-bution When making saccadic eye movements human subjects alter theirdistribution of reaction times in relation to the relative probabilities of differ-ent targets as if they had learned an estimate of the relevant likelihoodratios(Carpenter amp Williams 1995) Humans also can learn to discriminate almostoptimally between random sequences (fair coin tosses) and sequences thatare correlated or anticorrelated according to a Markov process this learningcan be accomplished from examples alone with no other feedback (Lopes ampOden 1987) Acquisition of language may require learning the jointdistribu-tion of successive phonemes syllables orwords and there is direct evidencefor learning of conditional probabilities from articial sound sequences byboth infants and adults (Saffran Aslin amp Newport 1996Saffran et al 1999)These examples which are not exhaustive indicate that the nervous systemcan learn an appropriate probabilistic model18 and this offers the oppor-tunity to analyze the dynamics of this learning using information-theoreticmethods What is the entropy of N successive reaction times following aswitch to a new set of relative probabilities in the saccade experiment How

remains to be learned about these cells it is difcult to imagine that the activity of so manyneurons constitutes an efcient representation of the current sensory inputs But if we livein a world where the predictive information in the movies reaching our retina diverges itis perfectly possible that an efcient representation of the predictive information availableto us at one instant requires thousands of times more space than an efcient representationof the image currently falling on our retina

18 As emphasized above many other learning problems including learning a functionfrom noisy examples can be seen as the learning of a probabilistic model Thus we expectthat this description applies to a much wider range of biological learning tasks

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 48: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2456 W Bialek I Nemenman and N Tishby

much information does a single reaction time provide about the relevantprobabilities Following the arguments above such analysis could lead toa measurement of the universal learning curve L (N)

The learning curve L (N) exhibited by a human observer is limited bythe predictive information in the time series of stimulus trials itself Com-paring L (N) to this limit denes an efciency of learning in the spirit ofthe discussion by Barlow (1983) While it is known that the nervous systemcan make efcient use of available information in signal processing tasks(cf Chapter 4 of Rieke et al 1997) and that it can represent this informationefciently in the spike trains of individual neurons (cf Chapter 3 of Riekeet al 1997 as well as Berry Warland amp Meister 1997 Strong et al 1998and Reinagel amp Reid 2000) it is not known whether the brain is an efcientlearning machine in the analogous sense Given our classication of learn-ing tasks by their complexity it would be natural to ask if the efciency oflearning were a critical function of task complexity Perhaps we can evenidentify a limitbeyond which efcient learning fails indicating a limit to thecomplexity of the internal model used by the brain during a class of learningtasks We believe that our theoretical discussion here at least frames a clearquestion about the complexity of internal models even if for the present wecan only speculate about the outcome of such experiments

An importantresult of our analysis is the characterization of time series orlearning problems beyond the class of nitely parameterizable models thatis the class with power-law-divergent predictive information Qualitativelythis class is more complex than any parametric model no matter how manyparameters there may be because of the more rapid asymptotic growth ofIpred(N) On the other hand with a nite number of observations N theactual amount of predictive information in such a nonparametric problemmay be smaller than in a model with a large but nite number of parametersSpecically if we have two models one with Ipred(N) raquo ANordm and one withK parameters so that Ipred(N) raquo (K2) log2 N the innite parameter modelhas less predictive information for all N smaller than some critical value

Nc raquomicro K

2Aordmlog2

sup3 K2A

acutepara1ordm (61)

In the regime N iquest Nc it is possible to achieve more efcient predictionby trying to learn the (asymptotically) more complex model as illustratedconcretely in simulations of the density estimation problem (Nemenman ampBialek 2001) Even if there are a nite number of parameters such as thenite number of synapses in a small volume of the brain this number maybe so large that we always have N iquest Nc so that it may be more effectiveto think of the many-parameter model as approximating a continuous ornonparametric one

It is tempting to suggest that the regime N iquest Nc is the relevant one formuch of biology If we consider for example 10 mm2 of inferotemporal cor-

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 49: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2457

tex devoted to object recognition(Logothetis amp Sheinberg 1996) the numberof synapses is K raquo 5pound109 On the other hand object recognition depends onfoveation and we move our eyes roughly three times per second through-out perhaps 10 years of waking life during which we master the art of objectrecognition This limits us to at most N raquo 109 examples Remembering thatwe must have ordm lt 1 even with large values of A equation 61 suggests thatwe operate with N lt Nc One can make similar arguments about very dif-ferent brains such as the mushroom bodies in insects (Capaldi Robinson ampFahrbach 1999) If this identication of biological learning with the regimeN iquest Nc is correct then the success of learning in animals must depend onstrategies that implement sensible priors over the space of possible models

There is one clear empirical hint that humans can make effective use ofmodels that are beyond nite parameterization (in the sense that predictiveinformation diverges as a power law) and this comes from studies of lan-guage Long ago Shannon (1951) used the knowledge of native speakersto place bounds on the entropy of written English and his strategy madeexplicit use of predictability Shannon showed N-letter sequences to nativespeakers (readers) asked them to guess the next letter and recorded howmany guesses were required before they got the right answer Thus eachletter in the text is turned into a number and the entropy of the distributionof these numbers is an upper bound on the conditional entropy (N) (cfequation 38) Shannon himself thought that the convergence as N becomeslarge was rather quick and quoted an estimate of the extensive entropy perletter S0 Many years later Hilberg (1990) reanalyzed Shannonrsquos data andfound that the approach to extensivity in fact was very slow certainly thereis evidence fora large component S1(N) N12 and this may even dominatethe extensive component for accessible N Ebeling and Poschel (1994 seealso Poschel Ebeling amp Ros Acirce 1995) studied the statistics of letter sequencesin long texts (such as Moby Dick) and found the same strong subextensivecomponent It would be attractive to repeat Shannonrsquos experiments with adesign that emphasizes the detection of subextensive terms at large N19

In summary we believe that our analysis of predictive information solvesthe problem of measuring the complexity of time series This analysis uni-es ideas from learning theory coding theory dynamical systems and sta-tistical mechanics In particular we have focused attention on a class ofprocesses that are qualitatively more complex than those treated in conven-

19 Associated with the slow approach to extensivity is a large mutual informationbetween words or characters separated by long distances and several groups have foundthat this mutual information declines as a power law Cover and King (1978) criticize suchobservations bynoting that this behavior is impossible in Markovchains of arbitraryorderWhile it is possible that existing mutual information data have not reached asymptotia thecriticism of Cover and King misses the possibility that language is not a Markov processOf course it cannot be Markovian if it has a power-law divergence in the predictiveinformation

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 50: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2458 W Bialek I Nemenman and N Tishby

tional learning theory and there are several reasons to think that this classincludes many examples of relevance to biology and cognition

Acknowledgments

We thank V Balasubramanian A Bell S Bruder C Callan A FairhallG Garcia de Polavieja Embid R Koberle A Libchaber A MelikidzeA Mikhailov O Motrunich M Opper R Rumiati R de Ruyter van Steven-inck N Slonim T Spencer S Still S Strong and A Treves for many helpfuldiscussions We also thank M Nemenman and R Rubin for help with thenumerical simulations and an anonymous referee for pointing out yet moreopportunities to connect our work with the earlier literature Our collabo-ration was aided in part by a grant from the US-Israel Binational ScienceFoundation to the Hebrew University of Jerusalem and work at Princetonwas supported in part by funds from NEC

References

Where available we give references to the Los Alamos endashprint archive Pa-pers may be retrieved online from the web site httpxxxlanlgovabswhere is the reference for example Adami and Cerf (2000) is found athttpxxxlanlgovabsadap-org9605002For preprints this is a primary ref-erence for published papers there may be differences between the publishedand electronic versions

Abarbanel H D I Brown R Sidorowich J J amp Tsimring L S (1993) Theanalysis of observed chaotic data in physical systems Revs Mod Phys 651331ndash1392

Adami C amp Cerf N J (2000) Physical complexity of symbolic sequencesPhysica D 137 62ndash69 See also adap-org9605002

Aida T (1999) Field theoretical analysis of onndashline learning of probability dis-tributions Phys Rev Lett 83 3554ndash3557 See also cond-mat9911474

Akaike H (1974a) Information theory and an extension of the maximum likeli-hood principle In B Petrov amp F Csaki (Eds) Second International Symposiumof Information Theory Budapest Akademia Kiedo

Akaike H (1974b)A new look at the statistical model identication IEEE TransAutomatic Control 19 716ndash723

Atick J J (1992) Could information theory provide an ecological theory ofsensory processing InW Bialek (Ed)Princeton lecturesonbiophysics(pp223ndash289) Singapore World Scientic

Attneave F (1954)Some informational aspects of visual perception Psych Rev61 183ndash193

Balasubramanian V (1997) Statistical inference Occamrsquos razor and statisticalmechanics on the space of probability distributions Neural Comp 9 349ndash368See also cond-mat9601030

Barlow H B (1959) Sensory mechanisms the reduction of redundancy andintelligence In D V Blake amp A M Uttley (Eds) Proceedings of the Sympo-

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 51: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2459

sium on the Mechanization of Thought Processes (Vol 2 pp 537ndash574) LondonH M Stationery Ofce

Barlow H B (1961) Possible principles underlying the transformation of sen-sory messages In W Rosenblith (Ed) Sensory Communication (pp 217ndash234)Cambridge MA MIT Press

Barlow H B (1983) Intelligence guesswork language Nature 304 207ndash209Barron A amp Cover T (1991) Minimum complexity density estimation IEEE

Trans Inf Thy 37 1034ndash1054Bennett C H (1985) Dissipation information computational complexity and

the denition of organization In D Pines (Ed)Emerging syntheses in science(pp 215ndash233) Reading MA AddisonndashWesley

Bennett C H (1990) How to dene complexity in physics and why InW H Zurek (Ed) Complexity entropy and the physics of information (pp 137ndash148) Redwood City CA AddisonndashWesley

Berry II M J Warland D K amp Meister M (1997) The structure and precisionof retinal spike trains Proc Natl Acad Sci (USA) 94 5411ndash5416

Bialek W (1995) Predictive information and the complexity of time series (TechNote) Princeton NJ NEC Research Institute

Bialek W Callan C G amp Strong S P (1996) Field theories for learningprobability distributions Phys Rev Lett 77 4693ndash4697 See also cond-mat9607180

Bialek W amp Tishby N (1999)Predictive information Preprint Available at cond-mat9902341

Bialek W amp Tishby N (in preparation) Extracting relevant informationBrenner N Bialek W amp de Ruyter vanSteveninck R (2000)Adaptive rescaling

optimizes information transmission Neuron 26 695ndash702Bruder S D (1998)Predictiveinformationin spiketrains fromtheblowy and monkey

visual systems Unpublished doctoral dissertation Princeton UniversityBrunel N amp Nadal P (1998) Mutual information Fisher information and

population coding Neural Comp 10 1731ndash1757Capaldi E A Robinson G E amp Fahrbach S E (1999) Neuroethology

of spatial learning The birds and the bees Annu Rev Psychol 50 651ndash682

Carpenter R H S amp Williams M L L (1995) Neural computation of loglikelihood in control of saccadic eye movements Nature 377 59ndash62

Cesa-Bianchi N amp Lugosi G (in press) Worst-case bounds for the logarithmicloss of predictors Machine Learning

Chaitin G J (1975) A theory of program size formally identical to informationtheory J Assoc Comp Mach 22 329ndash340

Clarke B S amp Barron A R (1990) Information-theoretic asymptotics of Bayesmethods IEEE Trans Inf Thy 36 453ndash471

Cover T M amp King R C (1978)A convergent gambling estimate of the entropyof English IEEE Trans Inf Thy 24 413ndash421

Cover T M amp Thomas J A (1991) Elements of information theory New YorkWiley

Crutcheld J P amp Feldman D P (1997) Statistical complexity of simple 1-Dspin systems Phys Rev E 55 1239ndash1243R See also cond-mat9702191

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 52: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2460 W Bialek I Nemenman and N Tishby

Crutcheld J P Feldman D P amp Shalizi C R (2000) Comment I onldquoSimple measure for complexityrdquo Phys Rev E 62 2996ndash2997 See also chao-dyn9907001

Crutcheld J P amp Shalizi C R (1999)Thermodynamic depth of causal statesObjective complexity via minimal representation Phys Rev E 59 275ndash283See also cond-mat9808147

Crutcheld J P amp Young K (1989) Inferring statistical complexity Phys RevLett 63 105ndash108

Eagleman D M amp Sejnowski T J (2000) Motion integration and postdictionin visual awareness Science 287 2036ndash2038

Ebeling W amp Poschel T (1994)Entropy and long-range correlations in literaryEnglish Europhys Lett 26 241ndash246

Feldman D P amp Crutcheld J P (1998) Measures of statistical complexityWhy Phys Lett A 238 244ndash252 See also cond-mat9708186

Gell-Mann M amp Lloyd S (1996) Information measures effective complexityand total information Complexity 2 44ndash52

de Gennes PndashG (1979) Scaling concepts in polymer physics Ithaca NY CornellUniversity Press

Grassberger P (1986) Toward a quantitative theory of self-generated complex-ity Int J Theor Phys 25 907ndash938

Grassberger P (1991) Information and complexity measures in dynamical sys-tems In H Atmanspacher amp H Schreingraber (Eds) Information dynamics(pp 15ndash33) New York Plenum Press

Hall P amp Hannan E (1988)On stochastic complexity and nonparametric den-sity estimation Biometrica 75 705ndash714

Haussler D Kearns M amp Schapire R (1994)Bounds on the sample complex-ity of Bayesian learning using information theory and the VC dimensionMachine Learning 14 83ndash114

Haussler D Kearns M Seung S amp Tishby N (1996)Rigorous learning curvebounds from statistical mechanics Machine Learning 25 195ndash236

Haussler D amp Opper M (1995) General bounds on the mutual informationbetween a parameter and n conditionally independent events In Proceedingsof the Eighth Annual Conference on ComputationalLearning Theory (pp 402ndash411)New York ACM Press

Haussler D amp Opper M (1997) Mutual information metric entropy and cu-mulative relative entropy risk Ann Statist 25 2451ndash2492

Haussler D amp Opper M (1998) Worst case prediction over sequences un-der log loss In G Cybenko D OrsquoLeary amp J Rissanen (Eds) The mathe-matics of information coding extraction and distribution New York Springer-Verlag

Hawken M J amp Parker A J (1991)Spatial receptive eld organization in mon-key V1 and its relationship to the cone mosaic InM S Landy amp J A Movshon(Eds) Computational models of visual processing (pp 83ndash93) Cambridge MAMIT Press

Herschkowitz D amp Nadal JndashP (1999)Unsupervised and supervised learningMutual information between parameters and observations Phys Rev E 593344ndash3360

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 53: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2461

Hilberg W (1990) The well-known lower bound of information in written lan-guage Is it a misinterpretation of Shannon experiments Frequenz 44 243ndash248 (in German)

Holy T E (1997) Analysis of data from continuous probability distributionsPhys Rev Lett 79 3545ndash3548 See also physics9706015

Ibragimov I amp Hasminskii R (1972) On an information in a sample about aparameter In Second International Symposium on Information Theory (pp 295ndash309) New York IEEE Press

Kang K amp Sompolinsky H (2001) Mutual information of population codes anddistancemeasures in probabilityspace Preprint Available at cond-mat0101161

Kemeney J G (1953)The use of simplicity in induction Philos Rev 62 391ndash315Kolmogoroff A N (1939) Sur lrsquointerpolation et extrapolations des suites sta-

tionnaires C R Acad Sci Paris 208 2043ndash2045Kolmogorov A N (1941) Interpolation and extrapolation of stationary random

sequences Izv Akad Nauk SSSR Ser Mat 5 3ndash14 (in Russian) Translation inA N Shiryagev (Ed) Selectedworks of A N Kolmogorov (Vol 23 pp 272ndash280)Norwell MA Kluwer

Kolmogorov A N (1965) Three approaches to the quantitative denition ofinformation Prob Inf Trans 1 4ndash7

Li M amp Vit Acircanyi P (1993) An Introduction to Kolmogorov complexity and its appli-cations New York Springer-Verlag

Lloyd S amp Pagels H (1988)Complexity as thermodynamic depth Ann Phys188 186ndash213

Logothetis N K amp Sheinberg D L (1996) Visual object recognition AnnuRev Neurosci 19 577ndash621

Lopes L L amp Oden G C (1987) Distinguishing between random and non-random events J Exp Psych Learning Memory and Cognition 13 392ndash400

Lopez-Ruiz R Mancini H L amp Calbet X (1995) A statistical measure ofcomplexity Phys Lett A 209 321ndash326

MacKay D J C (1992) Bayesian interpolation Neural Comp 4 415ndash447Nemenman I (2000) Informationtheory and learningA physicalapproach Unpub-

lished doctoral dissertation Princeton University See also physics0009032Nemenman I amp Bialek W (2001) Learning continuous distributions Simu-

lations with eld theoretic priors In T K Leen T G Dietterich amp V Tresp(Eds) Advances in neural information processing systems 13 (pp 287ndash295)MITPress Cambridge MA MIT Press See also cond-mat0009165

Opper M (1994) Learning and generalization in a two-layer neural networkThe role of the VapnikndashChervonenkis dimension Phys Rev Lett 72 2113ndash2116

Opper M amp Haussler D (1995) Bounds for predictive errors in the statisticalmechanics of supervised learning Phys Rev Lett 75 3772ndash3775

Periwal V (1997)Reparametrization invariant statistical inference and gravityPhys Rev Lett 78 4671ndash4674 See also hep-th9703135

Periwal V (1998) Geometrical statistical inference Nucl Phys B 554[FS] 719ndash730 See also adap-org9801001

Poschel T Ebeling W amp Ros Acirce H (1995) Guessing probability distributionsfrom small samples J Stat Phys 80 1443ndash1452

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 54: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

2462 W Bialek I Nemenman and N Tishby

Reinagel P amp Reid R C (2000) Temporal coding of visual information in thethalamus J Neurosci 20 5392ndash5400

Renyi A (1964) On the amount of information concerning an unknown pa-rameter in a sequence of observations Publ Math Inst Hungar Acad Sci 9617ndash625

Rieke F Warland D de Ruyter van Steveninck R amp Bialek W (1997) SpikesExploring the Neural Code (MIT Press Cambridge)

Rissanen J (1978) Modeling by shortest data description Automatica 14 465ndash471

Rissanen J (1984) Universal coding information prediction and estimationIEEE Trans Inf Thy 30 629ndash636

Rissanen J (1986) Stochastic complexity and modeling Ann Statist 14 1080ndash1100

Rissanen J (1987)Stochastic complexity J Roy Stat Soc B 49 223ndash239253ndash265Rissanen J (1989) Stochastic complexity and statistical inquiry Singapore World

ScienticRissanen J (1996) Fisher information and stochastic complexity IEEE Trans

Inf Thy 42 40ndash47Rissanen J Speed T amp Yu B (1992) Density estimation by stochastic com-

plexity IEEE Trans Inf Thy 38 315ndash323Saffran J R Aslin R N amp Newport E L (1996) Statistical learning by 8-

month-old infants Science 274 1926ndash1928Saffran J R Johnson E K Aslin R H amp Newport E L (1999) Statistical

learning of tone sequences by human infants and adults Cognition 70 27ndash52

Schmidt D M (2000) Continuous probability distributions from nite dataPhys Rev E 61 1052ndash1055

Seung H S Sompolinsky H amp Tishby N (1992)Statistical mechanics of learn-ing from examples Phys Rev A 45 6056ndash6091

Shalizi C R amp Crutcheld J P (1999) Computational mechanics Pattern andprediction structure and simplicity Preprint Available at cond-mat9907176

Shalizi C R amp Crutcheld J P (2000) Information bottleneck causal states andstatistical relevance bases How to represent relevant information in memorylesstransduction Available at nlinAO0006025

Shannon C E (1948) A mathematical theory of communication Bell Sys TechJ 27 379ndash423 623ndash656 Reprinted in C E Shannon amp W Weaver The mathe-matical theory of communication Urbana University of Illinois Press 1949

Shannon C E (1951) Prediction and entropy of printed English Bell Sys TechJ 30 50ndash64 Reprinted in N J A Sloane amp A D Wyner (Eds) Claude ElwoodShannon Collected papers New York IEEE Press 1993

Shiner J Davison M amp Landsberger P (1999) Simple measure for complexityPhys Rev E 59 1459ndash1464

Smirnakis S Berry III M J Warland D K Bialek W amp Meister M (1997)Adaptation of retinal processing to image contrast and spatial scale Nature386 69ndash73

Sole R V amp Luque B (1999) Statistical measures of complexity for strongly inter-acting systems Preprint Available at adap-org9909002

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001

Page 55: Predictability, Complexity, and Learningwbialek/our_papers/bnt_01a.pdfPredictability, Complexity, and Learning 2411 Finally, learning a model to describe a data set can be seen as

Predictability Complexity and Learning 2463

Solomonoff R J (1964) A formal theory of inductive inference Inform andControl 7 1ndash22 224ndash254

Strong S P Koberle R de Ruyter van Steveninck R amp Bialek W (1998)Entropy and information in neural spike trains Phys Rev Lett 80 197ndash200

Tishby N Pereira F amp Bialek W (1999) The information bottleneck methodIn B Hajek amp R S Sreenivas (Eds) Proceedings of the 37th Annual AllertonConference on Communication Control and Computing (pp 368ndash377) UrbanaUniversity of Illinois See also physics0004057

Vapnik V (1998) Statistical learning theory New York WileyVit Acircanyi P amp Li M (2000)Minimum description length induction Bayesianism

and Kolmogorov complexity IEEE Trans Inf Thy 46 446ndash464 See alsocsLG9901014

WallaceC Samp Boulton D M (1968)An information measure forclassicationComp J 11 185ndash195

Weigend A Samp Gershenfeld NA eds (1994)Time seriespredictionForecastingthe future and understanding the past Reading MA AddisonndashWesley

Wiener N (1949) Extrapolation interpolation and smoothing of time series NewYork Wiley

Wolfram S (1984) Computation theory of cellular automata Commun MathPhysics 96 15ndash57

Wong W amp Shen X (1995) Probability inequalities for likelihood ratios andconvergence rates of sieve MLErsquos Ann Statist 23 339ndash362

Received October 11 2000 accepted January 31 2001