arXiv:1606.06737v2 [cond-mat.dis-nn] 11 Jul 2016 · ab = P(X t+1 = ajX t = b). Such Markov matrices (also known as stochastic matri-ces) thus have the properties M ab 0 and P a M

Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language

Henry W. Lin and Max TegmarkDept. of Physics, Harvard University, Cambridge, MA 02138 and

Dept. of Physics & MIT Kavli Institute, Massachusetts Institute of Technology, Cambridge, MA 02139

(Dated: July 12, 2016)

We show that in many data sequences — from texts in different languages to melodies andgenomes — the mutual information between two symbols decays roughly like a power law withthe number of symbols in between the two. In contrast, we prove that Markov/hidden Markovprocesses generically exhibit exponential decay in their mutual information, which explains whynatural languages are poorly approximated by Markov processes. We present a broad class ofmodels that naturally reproduce this critical behavior. They all involve deep dynamics of a recursivenature, as can be approximately implemented by tree-like or recurrent deep neural networks. Thismodel class captures the essence of probabilistic context-free grammars as well as recursive self-reproduction in physical phenomena such as turbulence and cosmological inflation. We derive ananalytic formula for the asymptotic power law and elucidate our results in a statistical physicscontext: 1-dimensional “shallow” models (such as Markov models or regular grammars) will fail tomodel natural language, because they cannot exhibit criticality, whereas “deep” models with one ormore “hidden” dimensions representing levels of abstraction or scale can potentially succeed.

I. INTRODUCTION

Critical behavior, where long-range correlations decay asa power law with distance, has many important physicsapplications ranging from phase transitions in condensedmatter experiments to turbulence and inflationary fluc-tuations in our early Universe. It has important appli-cations beyond the traditional purview of physics as well[1–5] including new results which we report in Figure I:the number of bits of information provided by a sym-bol about another drops roughly as a power-law1 withdistance in sequences as diverse as the human genome,music by Bach, and text in English and French. Why isthis, when so many other correlations in nature insteaddrop exponentially [9]?

Better understanding such statistical properties of nat-ural languages (in the broad sense of information-transmitting sequences) is interesting not only for geneti-cists, musicologists and linguists, but also for the machinelearning community. Consider how your phone can auto-correct your typing, how you can free up disk space withdata compression software and how speech-to-text con-version enables you to talk to digital personal assistantssuch as Siri, Cortana and Google Now: these technolo-

1 The power law discussed here should not be confused with an-other famous power law that occurs in natural languages: Zipf’slaw [6]. Zipf’s law implies power law behavior in one-point statis-tics (in the histogram of word frequencies), whereas we are inter-ested in two-point statistics. In the former case, the power lawis in the frequency of words; in the latter case, the power lawis in the separation between characters. One can easily cook upsequences which obey Zipf’s law but are not critical and do notexhibit a power law in the mutual information. However, thereare models of certain physical systems where Zipf’s law followsfrom criticality [7, 8].

gies all exploit statistical properties of language, and canall be further improved if we can better understand theseproperties. Such deepened understanding is the goal ofthe present paper, focusing on critical behavior.

Natural languages are difficult for machines to under-stand. This has been known at least as far back as Tur-ing, whose eponymous test [14] relies upon this key fact.A tempting explanation is that natural language is some-thing uniquely human. But this is far from a satisfactoryone, especially given the recent successes of machines atperforming tasks as complex and as “human” as playingJeopardy! [15], chess [16], Atari games [17] and Go [18].We will show that computer descriptions of language tendto suffer from a much simpler problem that has nothingto do with meaning, understanding or being non-human:they tend to get the basic statistical properties wrong.We will prove that Markov processes, the workhorse ofmodeling any sequential data with translational symme-try and one of the handful of models that are analyticallytractable, fail epically by predicting exponentially decay-ing mutual information. On the other hand, impressiveprogress has been made by using deep neural networksfor natural language processing (see, e.g., [19–22]); forrecent reviews of deep neural networks, see [23, 24]. Un-fortunately, unlike Markov and related n-gram models,these deep networks are often treated like inscrutableblack boxes, given their enormous complexity. This hastriggered many recent efforts to understand their advan-tages analytically [25–28], from a functional [29], topolog-ical [30], and geometric [31, 32] perspective; this paperexplores the advantages from a statistical physics per-spective [33].

We will see that a key reason that currently popular re-current neural networks with long-short-term memory(LSTM) [34] do much better is that they can replicatecritical behavior, but that even they can be further im-proved, since they can under-predict long-range mutual

arX

iv:1

606.

0673

7v2

[co

nd-m

at.d

is-n

n] 1

1 Ju

l 201

6

2

1 10 100 100010- 6

10- 4

0.01

1

Critical 2D Ising modelEnglish Wikipedia

French text

Bach

Human genome

English text

Markov process

Mut

ual i

nfor

mat

ion

I(X,Y

) in

bits

Distance between symbols d(X,Y)

Critical 2D Ising model

FIG. 1: Decay of mutual information with separation. Here the mutual information in bits per symbol is shown as a functionof separation d(X,Y ) = |i− j|, where the symbols X and Y are located at positions i and j in the sequence in question, andshaded bands correspond to 1−σ error bars. All measured curves are seen to decay roughly as power laws, explaining why theycannot be accurately modeled as Markov processes — for which the mutual information instead plummets exponentially (the

example shown has I ∝ e−d/6). The measured curves are seen to be qualitatively similar to that of a famous critical system inphysics: a 1D slice through a critical 2D Ising model, where the slope is −1/2. The human genome data consists of 177,696,512base pairs {A, C, T,G} from chromosome 5 from the National Center for Biotechnology Information [10], with unknown basepairs omitted. The Bach data consists of 5727 notes from Partita No. 2 [11], with all notes mapped into a 12-symbol alphabetconsisting of the 12 half-tones {C, C#, D, D#, E, F, F#, G, G#, A, A#, B}, with all timing, volume and octave informationdiscarded. The three text corpuses are 100 MB from Wikipedia [12] (206 symbols), the first 114 MB of a French corpus [13](185 symbols) and 27 MB of English articles from slate.com (143 symbols). The large long range information appears to bedominated by poems in the French sample and by html-like syntax in the Wikipedia sample.

information.

A final goal of this paper is to ameliorate the follow-ing problem: machine learning typically involves usingsomething we do not fully understand (neural nets, etc.)to study something we also do not fully understand (En-glish, etc.). If we are ever to understand how some learn-ing algorithm works, we must first understand what weare trying to learn. For this reason, we will construct asimple class of analytically tractable models which quali-tatively reproduce (some of) the statistics of natural lan-guages — specifically, critical behavior.

This paper is organized as follows. In Section II, weshow how Markov processes exhibit exponential decay inmutual information with scale; we give a rigorous proof

of this and other results in a series of appendices. Toenable such proofs, we introduce a convenient quantitythat we term rational mutual information, which boundsthe mutual information and converges to it in the near-independence limit. In Section III, we define a subclass ofgenerative grammars and show that they exhibit criticalbehavior with power law decays. We then generalize ourdiscussion using Bayesian nets and relate our findings totheorems in statistical physics. In Section IV, we discussour results and explain how LSTM RNNs can reproducecritical behavior by emulating our generative grammarmodel.

slate.com

3

II. MARKOV IMPLIES EXPONENTIAL DECAY

For two random variables X and Y , the following defini-tions of mutual information are all equivalent:

I(X,Y ) ≡ S(X) + S(Y )− S(X,Y )

= D(p(XY )

∣∣∣∣p(X)p(Y ))

=

⟨logB

P (a, b)

P (a)P (b)

⟩=∑ab

P (a, b) logBP (a, b)

P (a)P (b),

(1)

where S ≡ 〈− logB P 〉 is the Shannon entropy [35] andD(p(XY )||p(X)p(Y )) is the Kullback-Leibler divergence[36] between the joint probability distribution and theproduct of the individual marginals. If the base of thelogarithm is taken to be B = 2, then I(X,Y ) is measuredin bits. The mutual information can be interpreted ashow much one variable knows about the other: I(X,Y )is the reduction in the number of bits needed to specifyfor X once Y is specified. Equivalently, it is the numberof encoding bits saved by using the true joint probabilityP (X,Y ) instead of approximating X and Y are inde-pendent. It is thus a measure of statistical dependenciesbetween X and Y . Although it is more conventional tomeasure quantities such as the correlation coefficient ρ instatistics and statistical physics, the mutual informationis more suitable for generic data, since it does not requirethat the variables X and Y are numbers or have any al-gebraic structure, whereas ρ requires that we are able tomultiply X · Y and average. Whereas it makes sense tomultiply numbers, is meaningless to multiply or averagetwo characters such as “!” and “?”.

The rest of this paper is largely a study of the mutualinformation between two random variables that are real-izations of a discrete stochastic process, with some sep-aration in τ in time. More concretely, we can think ofsequences {X1, X2, X3, · · · } of random variables, whereeach one might take values from some finite alphabet.For example, if we model English as a discrete stochasticprocess and take τ = 2, X could represent the first char-acter (“F”) in this sentence, whereas Y could representthe third character (“r”) in this sentence.

In particular, we start by studying the mutual informa-tion function of a Markov process, which is analyticallytractable. Let us briefly recapitulate some basic factsabout Markov processes (see, e.g., [37] for a pedagogicalreview). A Markov process is defined by a matrix M ofconditional probabilities Mab = P (Xt+1 = a|Xt = b).Such Markov matrices (also known as stochastic matri-ces) thus have the properties Mab ≥ 0 and

∑aMab = 1.

They fully specify the dynamics of the model:

pt+1 = Mpt, (2)

where pt is a vector with components P (Xt = a) thatspecifies the probability distribution at time t. Let λi

denote the eigenvalues of M, sorted by decreasing mag-nitude: |λ1| ≥ |λ2| ≥ |λ3|... All Markov matrices have|λi| ≤ 1, which is why blowup is avoided when equa-tion (2) is iterated, and λ1 = 1, with the correspondingeigenvector giving a stationary probability distribution µsatisfying Mµ = µ.

In addition, two mild conditions are usually imposedon Markov matrices: M is irreducible, meaning thatevery state is accessible from every other state (other-wise, we could decompose the Markov process into sepa-rate Markov processes). Second, to avoid processes like1→ 2→ 1→ 2 · · · that will never converge, we take theMarkov process to be aperiodic. It is easy to show usingthe Perron-Frobenius theorem that being irreducible andaperiodic implies |λ2| < 1.

This section is devoted to the intuition behind the fol-lowing theorem, whose full proof is given in AppendixA and B. The theorem states roughly that for a Markovprocess, the mutual information between two points intime t1 and t2 decays exponentially for large separation|t2 − t1|:

Theorem 1: Let M be a Markov matrix that gener-ates a Markov process. If M is irreducible and aperiodic,then the asymptotic behavior of the mutual informationI(t1, t2) is exponential decay toward zero for |t2−t1| � 1with decay timescale log 1

|λ2| , where λ2 is the second

largest eigenvalue of M. If M is reducible or periodic,I can instead decay to a constant; no Markov processwhatsoever can produce power-law decay. Suppose M isirreducible and aperiodic so that pt → µ as t → ∞ asmentioned above. This convergence of one-point statis-tics, e.g., pt, has been well-studied [37]. However, onecan also study higher order statistics such as the jointprobability distribution for two points in time. For suc-cinctness, let us write P (a, b) ≡ P (X = a, Y = b), whereX = Xt1 and Y = Xt2 and τ ≡ |t2 − t1|. We are inter-ested in the asymptotic situation where the Markov pro-cess has converged to its steady state, so the marginaldistribution P (a) ≡

∑b P (a, b) = µa, independently of

time.

If the joint probability distribution approximately fac-torizes as P (a, b) ≈ µaµb for sufficiently large and well-separated times t1 and t2 (as we will soon prove), themutual information will be small. We can therefore Tay-lor expand the logarithm from equation (1) around thepoint P (a, b) = P (a)P (b), giving

I(X,Y ) =

⟨logB

(P (a, b)

P (a)P (b)

)⟩=

⟨logB

[1 +

P (a, b)

P (a)P (b)− 1

]⟩≈⟨

P (a, b)

P (a)P (b)− 1

⟩1

lnB=IR(X,Y )

lnB,

(3)

4

where we have defined the rational mutual information

IR ≡⟨

P (a, b)

P (a)P (b)− 1

⟩. (4)

For comparing the rational mutual information with theusual mutual information, it will be convenient to take eas the base B of the logarithm. We derive useful prop-erties of the rational mutual information in Appendix A.To mention just one, we note that the rational mutualinformation is not just asymptotically equal to the mu-tual information in the limit of near-independence, butit also provides a strict upper bound on it: 0 ≤ I ≤ IR.

Let us without loss of generality take t2 > t1. Theniterating equation (2) τ times gives P (b|a) = (Mτ )ba.Since P (a, b) = P (a)P (b|a), we obtain

IR + 1 =

⟨P (a, b)

P (a)P (b)

⟩=∑ab

P (a, b)P (a, b)

P (a)P (b)

=∑ab

P (b|a)2P (a)2

P (a)P (b)=∑ab

µaµb

[(Mτ )ba]2.

We will continue the proof by considering the typicalcase where the eigenvalues of M are all distinct (non-degenerate) and the Markov matrix is irreducible andaperiodic; we will generalize to the other cases (whichform a set of measure zero) in Appendix B. Since theeigenvalues are distinct, we can diagonalize M by writ-ing

M = BDB−1 (5)

for some invertible matrix B and some a diagonal matrixD whose diagonal elements are the eigenvalues: Dii =λi. Raising equation (5) to the power τ gives Mτ =BDτB−1, i.e.,

(Mτ )ba =∑c

λτc Bbc(B−1)ca. (6)

Since M is non-degenerate, irreducible and aperiodic, 1 =λ1 > |λ2| > · · · > |λn|, so all terms except the first inthe sum of equation (6) decay exponentially with τ , ata decay rate that grows with c. Defining r = λ3/λ2, wehave

(Mτ )ba = Bb1B−11a + λτ2

[Bb2B

−12a +O(rτ )

]= µb + λτ2Aba, (7)

where we have made use of the fact that an irreducibleand aperiodic Markov process must converge to its sta-tionary distribution for large τ , and we have defined Aas the expression in square brackets above, satisfyinglimτ→∞Aba = Bb2B

−12a . Note that

∑bAba = 0 in or-

der for M to be properly normalized.

Substituting equation (7) into equation (8) and using thefacts that

∑a µa = 1 and

∑bAba = 0, we obtain

IR =∑ab

µaµb

[(Mτ )ba]2 − 1

=∑ab

µaµb

(µ2b + 2µbλ

τ2Aba + λ2τ

2 A2ba

)− 1

=∑ab

λ2τ2

(µ−1b A2

baµa)

= Cλ2τ2 ,

(8)

where the term in the last parentheses is of the formC = C0 +O(rτ ).

In summary, we have shown that an irreducible and ape-riodic Markov process with non-degenerate eigenvaluescannot produce critical behavior, because the mutual in-formation decays exponentially. In fact, no Markov pro-cesses can, as we show in Appendix B.

To hammer the final nail into the coffin of Markov pro-cesses as models of critical behavior, we need to closea final loophole. Their fundamental problem is lack oflong-term memory, which can be superficially overcomeby redefining the state space to include symbols from thepast. For example, if the current state is one of n andwe wish the process to depend on the the last τ sym-bols, we can define an expanded state space consistingof the nτ possible sequences of length τ , and a corre-sponding nτ × nτ Markov matrix (or an nτ × n table ofconditional probabilities for the next symbol given thelast τ symbols). Although such a model could fit thecurves in Figure I in theory, it cannot in practice, be-cause M requires way more parameters than there areatoms in our observable universe (∼ 1078): even for asfew as n = 4 symbols and τ = 1000, the Markov processinvolves over 41000 ∼ 10602 parameters. Scale-invarianceaside, we can also see how Markov processes fail simplyby considering the structure of text. To model Englishwell, M would need to correctly close parentheses evenif they were opened more than τ = 100 characters ago,requiring an M-matrix with than n100 parameters, wheren > 26 is the number of characters used.

We can significantly generalize Theorem 1 into a theoremabout hidden Markov models (HMM). In an HMM, theobserved sequence X1, · · · , Xn is only part of the pic-ture: there are hidden variables Y1, · · · , Yn that them-selves form a Markov chain. We can think of an HMM asfollows: imagine a machine with an internal state spaceY that updates itself according to some Markovian dy-namics. The internal dynamics are never observed, butat each time-step, it also produces some output Yi → Xi

that form the sequence which we can observe. Thesemodels are quite general and are used to model a wealthof empirical data (see, e.g., [38]).

Theorem 2: Let M be a Markov matrix that generatesthe transitions between hidden states Yi in an HMM. IfM is irreducible and aperiodic, then the asymptotic be-havior of the mutual information I(t1, t2) is exponential

5

decay toward zero for |t2 − t1| � 1 with decay timescalelog 1|λ2| , where λ2 is the second largest eigenvalue of M.

This theorem is a strict generalization of Theorem 1,since given any Markov process M with correspondingmatrix M, we can construct an HMM that reproducesthe exact statistics of M by using M as the transitionmatrix between the Y ’s and generating Xi from Yi bysimply setting xi = yi with probability 1.

The proof is very similar in spirit to the proof of Theorem1, so we will just present a sketch here, leaving a full proofto Appendix B. Let G be the Markov matrix that governsXi → Yi. To compute the joint probability between tworandom variables Xt1 and Xt2 , we simply compute thejoint probability distribution between Yt1 and Yt2 , whichagain involves a factor of Mτ and then use two factors ofG to convert the joint probability on Yt1 , Yt2 to a jointprobability on Xt1 , Xt2 . These additional two factors ofG will not change the fact that there is an exponentialdecay given by Mτ .

A simple, intuitive bound from information theory(namely the data processing inequality [37]) givesI(Yt1 , Yt2) ≥ I(Yt1 , Xt2) ≥ I(Xt1 , Xt2). However, The-orem 1 implies that I(Yt1 , Yt2) decays exponentially.Hence I(Xt1 , Xt2) must also decay at least as fast as ex-ponentially.

There is a well-known correspondence between so-calledprobabilistic regular grammars [39] (sometimes referredto as stochastic regular grammars) and HMMs. Given aprobabilistic regular grammar, one can generate an HMMthat reproduces all statistics and vice versa. Hence, wecan also state Theorem 2 as follows:

Corollary: No probabilistic regular grammar exhibitscriticality.

In the next section, we will show that this statement isnot true for context-free grammars.

III. POWER LAWS FROM GENERATIVEGRAMMAR

If computationally feasible Markov processes cannot pro-duce genomes, melodies, texts or other sequences withroughly critical behavior, then how do such sequencesarise? What sort of alternative processes can generatethem? This question is not only interesting theoretically,but also important in practically: models which can ap-proximate such critical sequences could explain how somemachine learning algorithms perform better than othersin tasks like such as language processing, and might sug-gest ways to improve existing algorithms. In the best casescenario, theoretical models may even shed light on howhuman brains can efficiently generate English sentenceswithout storing googols of parameters.

One answer advanced by Chomsky [40] and others whichwill loosely inspire our work is the idea of deep structure.

Roughly speaking, the idea is that language is generatedin a hierarchical rather than linear fashion. When wewrite an essay, we do so by thinking about some big idea,and then breaking it into sub-ideas, and sub-sub-ideas,etc. Similarly, we generate a sentence by first choosingits basic structure and then fleshing-out each part of thesentence with more modifiers, etc. For example, the twosentences “Bob loves Alice” and “Alice is loved by Bob”are “close” in meaning but there could be nothing similarabout them if English were Markov. On the other hand, ifEnglish is generated hierarchically, these sentences mightbe close in the sense that they diverged close to the leavesof the generative tree.

A. A simple recursive grammar model

We can formalize the above considerations by giving pro-duction rules for a toy language L over an alphabet A.In the parlance of theoretical linguistics, our languageis generated by a stochastic or probabilistic context-freegrammar (PCFG) [41–44]. We will discuss the relation-ship between our model and a generic PCFG in SectionC. The language is defined by how a native speaker of Lproduces sentences: first, she draws one of the |A| char-acters from some probability distribution µ on A. Shethen takes this character x0 and replaces it with q newsymbols, drawn from a probability distribution P (b|a),where a ∈ A is the first symbol and b ∈ A is any of thesecond symbols. This is repeated over and over. After usteps, she has a sentence of length qu.2

One can ask for the character statistics of the sentenceat production step u given the statistics of the sentenceat production step u − 1. The character distribution issimply

Pu(b) =∑a

P (b|a)Pu(a). (9)

Of course this equation does not imply that the processis a Markov process when the sentences are read left toright. To characterize the statistics as read from left toright, we really want to compute the statistical depen-dencies within a given sequence, e.g., at fixed u.

To see that the mutual information decays like a powerlaw rather than exponentially with separation, considertwo random variables X and Y separated by τ . One canask how many generations took place between X andthe nearest ancestor of X and Y . Typically, this will beabout logq τ generations. Hence in the tree graph shown

2 This exponential blow-up is reminiscent of de Sitter space incosmic inflation. There is actually a much deeper mathematicalanalogy involving conformal symmetry and p-adic numbers thathas been discussed by Harlow et al.[45].

6

0 21

4 43

2

6 65

6 65

43

8 87

8 87

6

8 87

8 87

654

Time

Abst

ract

ion

leve

l

0 1 2 3 4 5 6 7 8 9 10Time

1311 12 14 15

Deep dynamics:

Shallow dynamics:

FIG. 2: Both a traditional Markov process (top) and ourrecursive generative grammar process (bottom) can be repre-sented as Bayesian networks, where the random variable ateach node depends only on the node pointing to it with anarrow. The numbers show the geodesic distance ∆ to theleftmost node, defined as the smallest number of edges thatmust be traversed to get there. Our results show that themutual information decays exponentially with ∆. Since thisgeodesic distance ∆ grows only logarithmically with the sep-aration in time in a hierarchical generative grammar (the hi-erarchy creates very efficient shortcuts), the exponential killsthe logarithm and we are left with power-law decays of mutualinformation in such languages.

in Figure 2, which illustrates the special case q = 2, thenumber of edges ∆ between X and Y is about 2 logq τ .Hence by the previous result for Markov processes, weexpect an exponential decay of the mutual informationin the variable ∆ ∼ 2 logq τ . This means that I(X,Y )should be of the form

I(X,Y ) ∼ q−γ∆ = q2 logq(τ)−γ = τ−2γ , (10)

where γ is controlled by the second-largest eigenvalue ofG, the matrix of conditional probabilities P (b|a). Butthis exponential decay in ∆ is exactly a power-law de-cay in τ ! This intuitive argument is transformed into arigorous proof in Appendix C.

B. Further Generalization: strongly correlatedcharacters in words

In the model we have been describing so far, all nodesemanating from the same parent can be freely permutedsince they are conditionally independent. In this sense,characters within a newly generated word are uncorre-lated. We call models with this property weakly corre-lated. There are still arbitrarily large correlations be-tween words, but not inside of words. If a weakly corre-lated grammar allows a → ab, it must allow for a → bawith the same probability. We now wish to relax thisproperty to allow for the strongly-correlated case wherevariables may not be conditionally independent given theparents. This allows us to take a big step towards mod-

eling realistic languages: in English, god significantly dif-fers in meaning and usage from dog.

In the previous computation, the crucial ingredient wasthe joint probability P (a, b) = P (X = a, Y = b). Let usstart with a seemingly trivial remark. This joint proba-bility can be re-interpreted as a conditional joint prob-ability. Instead of X and Y being random variables atspecified sites t1 and t2, we can view them as randomvariables at randomly chosen locations, conditioned ontheir locations being t1 and t2. Somewhat pedantically,we write P (a, b) = P (a, b|t1, t2). This clarifies the impor-tant fact that the only way that P (a, b|t1, t2) depends ont1 and t2 is via a dependence on ∆(t1, t2). Hence

P (a, b|t1, t2) = P (a, b|∆). (11)

This equation is specific to weakly correlated models anddoes not hold for generic strongly correlated models.

In computing the mutual information as a function ofseparation, the relevant quantity is the right hand side ofequation (7). The reason is that in practical scenarios,we estimate probabilities by sampling a sequence at fixedseparation t1− t2, corresponding to ∆ ≈ 2 logq |t2− t1|+O(1), but varying t1 and t2. (The O(1) term is discussedin Appendix E).

Now whereas P (a, b|t1, t2) will change when strong cor-relations are introduced, P (a, b|∆) will retain a very sim-ilar form. This can be seen as follows: knowledge of thegeodesic distance corresponds to knowledge of how highup the closest parent node is in the hierarchy (see Figure1). Imagine flowing down from the parent node to theleaves. We start with the stationary distribution µi atthe parent node. At the first layer below the parent node(corresponding to a causal distance ∆−2), we get Qrr′ ≡P (rr′) =

∑i PS(rr′|i)P (i), where the symmetrized prob-

ability PS = 12

∑i[P (rr′|i) + P (r′r|i)] comes into play

because knowledge of the fact that r, r′ are separated by∆ − 2 gives no information about their order. To con-tinue this process to the second stage and beyond, weonly need the matrix Gsr = P (s|r) =

∑s′ PS(ss′|r). The

reason is that since we only wish to compute the two-point function at the bottom of the tree, the only placewhere a three-point function is ever needed is at the verytop of the tree, where we need to take a single parent intotwo children nodes. After that, the computation only in-volves evolving a child node into a grand-child node, andso forth. Hence the overall two-point probability matrixP (ab|∆) is given by the simple equation

P(∆) =(G∆/2−1

)Q(G∆/2−1

)t. (12)

As we can see from the above formula, changing to thestrongly correlated case essentially reduces to the weaklycorrelated case where

P(∆) =(G∆/2

)diag(µ)

(G∆/2

)t, (13)

7

except for a perturbation near the top of the tree. We canthink of the generalization as equivalent to the old modelexcept for a different initial condition. We thus expect onintuitive grounds that the model will still exhibit powerlaw decay. This intuition is correct, as we will proverigorously in Appendix C.

C. Further Generalization: Bayesian networks andcontext-free grammars

Just how generic is the scaling behavior of our model?What if the length of the words is not constant? Whatabout more complex dependencies between layers? If weretrace the derivation in the above arguments, it becomesclear that the only key feature of all of our models consid-ered so far is that the rational mutual information decaysexponentially with the causal distance ∆:

IR ∼∝ e−γ∆. (14)

This is true for (hidden) Markov processes and the hier-archical grammar models that we have considered above.So far we have defined ∆ in terms of quantities specificto these models; for a Markov process, ∆ is simply thetime separation. Can we define ∆ more generically? Inorder to do so, let us make a brief aside about Bayesiannetworks. Formally, a Bayesian net is a directed acyclicgraph (DAG), where the vertices are random variablesand conditional dependencies are represented by the ar-rows. Now instead of thinking of X and Y as living atcertain times (t1, t2), we can think of them as living atvertices (i, j) of the graph.

We define ∆(i, j) as follows. Since the Bayesian net is aDAG, it is equipped with a partial order ≤ on vertices.We write k ≤ l iff there is a path from k to l, in which casewe say that k is an ancestor of l. We define the L(k, l)to be the number of edges on the shortest directed pathfrom k to l. Finally, we define the causal distance ∆(i, j)to be

∆(i, j) ≡ minx≤i,x≤j

L(x, i) + L(x, j). (15)

It is easy to see that this reduces to our previous defini-tion of ∆ for Markov processes and recursive generativetrees (see Figure 2).

Is it true that our exponential decay result from equa-tion (14) holds even for a generic Bayesian net? Theanswer is yes, under a suitable approximation. The ap-proximation is to ignore long paths in the network whencomputing the mutual information. In other words, themutual information tends to be dominated by the short-est paths via a common ancestor, whose length is ∆. Thisis a generally a reasonable approximation, because theselonger paths will give exponentially weaker correlations,

so unless the number of paths increases exponentially (orfaster) with length, the overall scaling will not change.

With this approximation, we can state a key finding ofour theoretical work. Deep models are important becausewithout the extra “dimension” of depth/abstraction,there is no way to construct “shortcuts” between randomvariables that are separated by large amounts of timewith short-range interactions; 1D models will be doomedto exponential decay. Hence the ubiquity of power lawsexplains the success of deep learning. In fact, this can beseen as the Bayesian net version of the important resultin statistical physics that there are no phase transitionsin 1D [46, 47].

There are close analogies between our deep recursivegrammar and more conventional physical systems. Forexample, according to the emerging standard model ofcosmology, there was an early period of cosmological in-flation when density fluctuations get getting added on afixed scale as space itself underwent repeated doublings,combining to produce an excellent approximation to apower-law correlation function. This inflationary pro-cess is simply a special case of our deep recursive model(generalized from 1 to 3 dimensions). In this case, thehidden “depth” dimension in our model corresponds tocosmic time, and the time parameter which labels theplace in the sequence of interest corresponds to space. Asimilar physical analogy is turbulence in a fluid, whereenergy in the form of vortices cascades from large scalesto ever smaller scales through a recursive process wherelarger vortices create smaller ones, leading to a scale-invariant power spectrum. There is also a close analogyto quantum mechanics: in equation (13) expresses the ex-ponential decay of the mutual information with geodesicdistance through the Bayesian network; in quantum me-chanics, the correlation function of a many body systemdecays exponentially with the geodesic distance definedby the tensor network which represents the wavefunction[48].

It is also worth examining our model using techniquesfrom linguistics. A generic PCFG G consists of threeingredients:

1. An alphabet A = A ∪ T which consists of non-terminal symbols A and terminal symbols T .

2. A set of production rules of the form a→ B, wherethe left hand side a ∈ A is always a single non-terminal character and B is a string consisting ofsymbols in A.

3. Probabilities associated with each production ruleP (a → B), such that for each a ∈ A,

∑B P (a →

B) = 1.

It is a remarkable fact that any stochastic-context freegrammars can be put in Chomsky normal form [43, 49].This means that given G, there exists some other gram-mar G such that all the production rules are either of

8

the form a → bc or a → α, where a, b, c ∈ A andα ∈ T and the corresponding languages L(G) = L(G).In other words, given some complicated grammar G, wecan always find a grammar G such that the correspondingstatistics of the languages are identical and all the pro-duction rules replace a symbol by at most two symbols(at the cost of increasing the number of production rulesin G).

This formalism allows us to strengthen our claims. Ourmodel with a branching factor q = 2 is precisely theclass of all context-free grammars that are generated bythe production rules of the form a → bc. While thismight naively seem like a very small subset of all possi-ble context-free grammars, the fact that any context-freegrammar can be converted into Chomsky normal formshows that our theory deals with a generic context-freegrammar, except for the additional step of producing ter-minal symbols from non-terminal symbols. Starting froma single symbol, the deep dynamics of the PCFG in nor-mal form are given by a strongly-correlated branchingprocess with q = 2 which proceeds for a characteristicnumber of productions before terminal symbols are pro-duced. Before most symbols have been converted to ter-minal symbols, our theory applies, and power-law cor-relations will exist amongst the non-terminal symbols.To the extent that the terminal symbols that are thenproduced from non-terminal symbols reflect the correla-tions of the non-terminal symbols, we expect context-freegrammars to be able to produce power law correlations.

From our corollary to Theorem 2, we know that regu-lar grammars cannot exhibit power-law decays in mu-tual information. Hence context-free grammars are thesimplest grammars which support criticality, e.g., theyare the lowest in the Chomsky hierarchy that supportscriticality. Note that our corollary to Theorem 2 also im-plies that not all context-free grammars exhibit criticalitysince regular grammars are a strict subset of context-freegrammars. Whether one can formulate an even sharpercriterion should be the subject of future work.

IV. DISCUSSION

We have shown that many data sequences generated forthe purposes of communications — from English andFrench text to Bach and the human genome — exhibitcritical behavior, where the mutual information betweensymbols decays roughly like a power law with separation.By introducing a quantity we term rational mutual infor-mation, we have proved that (hidden) Markov processesgenerically exhibit exponential decay, whereas deep gen-erative grammars exhibit power law decays. This ex-plains why natural languages are poorly approximatedby Markov processes, but better approximated by deeprecurrent neural networks now widely used in machinelearning for natural language processing, as we will dis-cuss in detail below. Furthermore, we have identified

a crucial ingredient of any successful natural languagemodel: it must have one or more “hidden” dimensions,which can be used to provide shortcuts between distantparts of the network; hence leading to longer-range cor-relations that are crucial for critical behavior.

Let us now explore some useful implications of these re-sults, both for understanding the success of certain neuralnetwork architectures and for using the mutual informa-tion function as a tool for validating machine learningalgorithms.

A. Connection to Recurrent Neural Networks

While the generative grammar model is appealing froma linguistic perspective, it may superficially appear tohave little to do with machine learning algorithms thatare implemented in practice. However, as we will nowsee, this model can in fact be viewed an idealized versionof a long-short term memory (LSTM) recurrent neuralnetwork (RNN) that is generating (“hallucinating”) a se-quence.

First of all, Figure 4 shows that an LSTM RNN canin fact reproduce critical behavior. In this example, wetrained an RNN (consisting of three hidden LSTM layersof size 256 as described in [20]) to predict the next char-acter in the 100MB Wikipedia sample known as enwik8[12]. We then used the LSTM to hallucinate 1 MB oftext and measured the mutual information as a functionof distance. Figure 4 shows that not only is the resultingmutual information function a rough power law, but italso has a slope that is relatively similar to the original.

We can understand this success by considering a simpli-fied model that is less powerful and complex than a fullLSTM, but retains some of its core features — such anapproach to studying deep neural nets has proved fruitfulin the past (e.g., [27]).

The usual implementation of LSTMs consists of multiplecells stacked one on top of each other. Each cell of theLSTM (depicted as a yellow circle in Fig. 3) has a statethat is characterized by a matrix of numbers Ct and isupdated according to the following rule

Ct = ft ◦Ct−1 + it ◦Dt, (16)

where ◦ denotes element wise multiplication, and Dt =Dt(Ct−1,xt) is some function of the input xt from thecell from the layer above (denoted by downward arrows inFigure 3, the details of which do not concern us. Generi-cally, a graph of this picture would look like a rectangularlattice, with each node having an arrow to its right (cor-responding to the first term in the above equation), andan arrow from above (corresponding to the second termin the equation). However, if the forget weights f weightsdecay rapidly with depth (e.g., as we go from the bottom

9

cell to the towards the top) so that the timescales for for-getting grow exponentially, we will show that a reason-able approximation to the dynamics is given by Figure 3.

If we neglect the dependency of Dt on Ct−1, the for-get gate ft leads to exponential decay of Ct−1 e.g.,Ct = f t ◦ C0; this is how LSTM’s forget their past.Note that all operations including exponentiation areperformed element-wise in this section only.

In general, a cell will smoothly forget its past over atimescale of ∼ log(1/f) ≡ τf . On timescales & τf , thecells are weakly correlated; on timescales . τf , the cellsare strongly correlated. Hence a discrete approximationto this above equation is the following:

Ct = Ct−1, for τf timesteps

= Dt(xt), on every τf + 1 timestep.(17)

This simple approximation leads us right back to the hi-erarchical grammar! The first line of the above equationis labeled “remember” in Figure 2 and the second lineis what we refer to as “Markov,” since the next statedepends only on the previous. Since each cell perfectlyremembers its pervious state for τf time-steps, the treecan be reorganized so that it is exactly of the form shownin Figure 3, by omitting nodes which simply copy the pre-vious state. Now supposing that τf grows exponentiallywith depth τf (layer i) ∝ q τf (layer i+1), we see that thesuccessive layers become exponentially sparse, which isexactly what happens in our deep grammar model, iden-tifying the parameter q, governing the growth of the for-get timescale, with the branching parameter in the deepgrammar model. (Compare Figure 2 and Figure 3.)

B. A new diagnostic for machine learning

How can one tell whether an neural network can be fur-ther improved? For example, an LSTM RNN similar tothe one we used in Figure 3 can predict Wikipedia textwith a residual entropy ∼ 1.4 bits/character [20], whichis very close to the performance of current state of theart custom compression software — which achieves ∼ 1.3bits/character [50]. Is that essentially the best compres-sion possible, or can significant improvements be made?

Our results provide an powerful diagnostic for sheddingfurther light on this question: measuring the mutual in-formation as a function of separation between symbolsis a computationally efficient way of extracting muchmore meaningful information about the performance of amodel than simply evaluating the loss function, usuallygiven by the conditional entropy H(Xt|Xt−1, Xt−2, . . . ).

Figure 3 shows that even with just three layers, theLSTM-RNN is able to learn long-range correlations; theslope of the mutual information of hallucinated text iscomparable to that of the training set. However, the fig-ure also shows that the predictions of our LSTM-RNN

TimeM

arkovM

arkovM

arkovM

arkov

Remember Forget

Remember RememberForget Forget

FIG. 3: Our deep generative grammar model can be viewedas an idealization of a long-short term memory (LSTM) recur-rent neural net, where the “forget weights” drop with depth sothat the forget timescales grow exponentially with depth. Thegraph drawn here is clearly isomorphic to the graph drawn inFigure 1. For each cell, we approximate the usual incremen-tal updating rule by either perfectly remembering the previ-ous state (horizontal arrows) or by ignoring the previous stateand determining the cell state by a random rule depending onthe node above (vertical arrows).

are far from optimal. Interestingly, the hallucinated textshows about the same mutual information for distances∼ O(1), but significantly less mutual information at largeseparation. Without requiring any knowledge about thetrue entropy of the input text (which is famously NP-hard to compute), this figure immediately shows thatthe LSTM-RNN we trained is performing sub-optimally;it is not able to capture all the long-term dependenciesfound in the training data.

As a comparison, we also calculated the bigram transi-tion matrix P (X3X4|X1X2) from the data and used itto hallucinate 1 MB of text. Despite the fact that thishigher order Markov model needs ∼ 103 more parame-ters than our LSTM-RNN, it captures less than a fifthof the mutual information captured by the LSTM-RNNeven at modest separations & 5.

In summary, Figure 3 shows both the successes and short-comings of machine learning. On the one hand, LSTM-RNN’s can capture long-range correlations much moreefficiently than Markovian models; on the other hand,they cannot match the two point functions of trainingdata, never mind higher order statistics!

One might wonder how the lack of mutual information atlarge scales for the bigram Markov model is manifestedin the hallucinated text. Below we give a line from the

10

1 10 100 100010- 6

10- 5

10- 4

0.001

0.010

0.100

1


Mut

ual i

nfor

mat

ion

I(X,Y

) in

bits

Actual WikipediaLSTM-RNN-hallucinated Wikipedia

Markov-hallucinated Wikipedia

FIG. 4: Diagnosing different models with by hallucinatingtext and then measuring the mutual information as a func-tion of separation. The red line is the mutual information ofenwik8, a 100 MB sample of English Wikipedia. In shadedblue is the mutual information of hallucinated Wikipedia froma trained LSTM with 3 layers of size 256. We plot in solidblack the mutual information of a Markov process on sin-gle characters, which we compute exactly. (This would cor-respond to the mutual information of hallucinations in thelimit where the length of the hallucinations goes to infinity).This curve shows a sharp exponential decay after a distanceof ∼ 10, in agreement with our theoretical predictions. Wealso measured the mutual information for hallucinated texton a Markov process for bigrams, which still underperformsthe LSTMs in long-ranged correlations, despite having ∼ 103

more parameters than

Markov hallucinations:

[[computhourgist, Flagesernmenserved whirequotesor thand dy excommentaligmaktophy asits:Fran at ||<If ISBN 088;&ampategorand on of to [[Prefung]]’ and at them rector>

This can be compared with an example from the LSTMRNN:

Proudknow pop groups at Oxford- [http://ccw.com/faqsisdaler/cardiffstwander--helgar.jpg] and Cape Normans’s firstattacks Cup rigid (AM).

Despite using many fewer parameters, the LSTM man-ages to produce a realistic looking URL and is able toclose brackets correctly [51], something that the Markovmodel struggles with.

C. Outlook

Although great challenges remain to accurately modelnatural languages, our results at least allow us to improveon some earlier answers to key questions we sought toaddress :

1. Why is natural language so hard? The old answerwas that language is uniquely human. The new an-swer is that at least part of the difficulty is that nat-ural language is a critical system, with long-rangedcorrelations that are difficult for machines to learn.

2. Why are machines bad at natural languages, andwhy are they good? The old answer is that Markovmodels are simply not brain/human-like, whereasneural nets are more brain-like and hence better.The new answer is that Markov models or other1-dimensional models cannot exhibit critical be-havior, whereas neural nets and other deep models(where an extra hidden dimension is formed by thelayers of the network) are able to exhibit criticalbehavior.

3. How can we know when machines are bad or good?The old answer is to compute the loss function.The new answer is to also compute the mutual in-formation as a function of separation, which canimmediately show how well the model is doing atcapturing correlations on different scales.

Future studies could include more comprehensive mea-surements from multiple language/music corpuses andother one dimensional data. In addition, more theoret-ical work is required to explain why natural languageshave slopes that are all comparable ∼ 1/2. In statisticalphysics, “coincidences” of these sort are usually signs ofuniversality: many seemingly unrelated systems have thesame long-wavelength effective field theory at the criticalpoint and hence display the same power law slopes. Per-haps something similar is happening here, though thisis unexpected from the deep generative grammar modelwhere any power law slope is possible.

Acknowledgments: This work was supported bythe Foundational Questions Institute http://fqxi.org.The authors wish to thank Noam Chomsky and GregLessard for valuable comments on the linguistic aspects ofthis work, Taiga Abe, Meia Chita-Tegmark, Hanna Field,Esther Goldberg, Emily Mu, John Peurifoi, Tomaso Pog-gio, Luis Seoane, Leon Shen and David Theurel for help-ful discussions and encouragement, Michelle Xu for helpacquiring genome data and the Center for Brains Mindsand Machines (CMBB) for hospitality.

Author contributions: HL proposed the project ideain collaboration with MT. HL and MT collaborativelyformulated the proofs, performed the numerical experi-ments, analyzed the data, and wrote the manuscript.

http://fqxi.org

11

Appendix A: Properties of rational mutualinformation

In this appendix, we prove the following elementary prop-erties of rational mutual information:

1. Symmetry: for any two random variables X andY , IR(X,Y ) = IR(Y,X). The proof is straightfor-ward:

IR(X,Y ) =∑ab

P (X = a, Y = b)2

P (X = a)P (Y = b)− 1

=∑ba

P (Y = b,X = a)2

P (Y = b)P (X = a)− 1 = IR(Y,X).

(A1)

2. Upper bound to mutual information: The log-arithm function satisfies ln(1 + x) ≤ x with equal-ity if and only if (iff) x = 0. Therefore setting

x = P (a,b)P (a)P (b) − 1 gives

I(X,Y ) =

⟨logB

P (a, b)

P (a)P (b)

⟩=

1

lnB

⟨ln

[1 +

(P (a, b)

P (a)P (b)− 1

)]⟩≤ 1

lnB

⟨P (a, b)

P (a)P (b)− 1

⟩=IR(X,Y )

lnB.

(A2)

Hence the rational mutual information IR ≥ I lnBwith equality iff I = 0 (or simply IR ≥ I if we usethe natural logarithm base B = e).

3. Non-negativity. It follows from the above in-equality that IR(X,Y ) ≥ 0 with equality iffP (a, b) = P (a)P (b), since IR = I = 0 iff P (a, b) =P (a)P (b). Note that this short proof is only pos-sible because of the information inequality I ≥ 0.From the definition of IR, it is only obvious thatIR ≥ −1; information theory gives a much tighterbound. Our findings 1-3 can be summarized as fol-lows:

IR(X,Y ) = IR(Y,X) ≥ I(X,Y ) ≥ 0, (A3)

where both equalities occur iff p(X,Y ) =p(X)p(Y ). It is impossible for one of the last tworelations to be an equality while the other is aninequality.

4. Generalization. Note that if we view the mutualinformation as the divergence between two jointprobability distributions, we can generalize the no-tion of rational mutual information to that of ra-tional divergence:

DR(p||q) =

⟨p

q

⟩− 1, (A4)

where the expectation value is taken with respectto the “true” probability distribution p. Note thatas it is written, p could be any probability mea-sure on either a discrete or continuous space. Theabove results can be trivially modified to show thatDR(p||q) ≥ DKL(p||q) and hence DR(p||q) ≥ 0,with equality iff p = q.

Appendix B: General proof for Markov processes

In this appendix, we drop the assumptions of non-degeneracy, irreducibility and non-periodicity made inthe main body of the paper where we proved that Markovprocesses lead to exponential decay.

1. The degenerate case

First, we consider the case where the Markov matrix Mhas degenerate eigenvalues. In this case, we cannot guar-antee that M can be diagonalized. However, any complexmatrix can be put into Jordan normal form. In Jordannormal form, a matrix is block diagonal, with each d× dblock corresponding to an eigenvalue with degeneracy d.These blocks have a particularly simple form, with blocki having λi on the diagonal and ones right above thediagonal. For example, if there are only three distincteigenvalues and λ2 is threefold degenerate, the the Jor-dan form of M would be

B−1MB =

1 0 0 0 00 λ2 1 0 00 0 λ2 1 00 0 0 λ2 00 0 0 0 λ3

. (B1)

Note that the largest eigenvalue is unique and equal to 1for all irreducible and aperiodic M. In this example, thematrix power Mτ is

B−1MτB =

1 0 0 0 00 λτ2

(τ1

)λτ−1

2

(τ2

)λτ−2

2 00 0 λτ2

(τ1

)λτ−1

2 00 0 0 λτ2 00 0 0 0 λτ3

. (B2)

In the general case, raising a matrix to an arbitrary powerwill yield a matrix which is still block diagonal, with eachblock being an upper triangular matrix. The importantpoint is that in block i, every entry scales ∝ λτi , up toa combinatorial factor. Each combinatorial factor growsonly polynomially with τ , with the degree of the polyno-mials in the ith block bounded by the multiplicity of λi,minus one.

12

Using this Jordan decomposition, we can replicate equa-tion (7) and write

Mτij = µi + λτ2Aij . (B3)

There are two cases, depending on whether the secondeigenvalue λ2 is degenerate or not. If not, then the equa-tion

limτ→∞

Aij = Bi2B−12j (B4)

still holds, since for i ≥ 3, (λi/λ2)τ decays faster thanany polynomial of finite degree. On the other hand, if thesecond eigenvalue is degenerate with multiplicity m2, weinstead define A with the combinatorial factor removed:

Mτij = µi +

(τ

m2

)λτ2Aij . (B5)

If m2 = 1, this definition simply reduces to the previousdefinition of A. With this definition,

limτ→∞

Aij = λ−m22 Bi2B

−1(2+m2)j , (B6)

Hence in the most general case, the mutual informationdecays like a polynomial P(τ)e−γτ , where γ = 2 ln 1

λ2.

The polynomial is non-constant if and only if the secondlargest eigenvalue is degenerate. Note that even in thiscase, the mutual information decays exponentially in thesense that it is possible to bound the mutual informationby an exponential.

2. The reducible case

Now let us generalize to the case where the Markov pro-cess is reducible, i.e., decomposable into a set of m non-interacting Markov processes for some integer m > 1.This means that the state space can be decomposed as

S =

m⋃i=1

Si, (B7)

where restricting the Markov process to each Si resultsin a well-defined Markov process on Si that does notinteract with any other Si. If the system starts off inS1, it can never transition to S2, and so forth. Now ifwe restrict our model to the subset of interest, the mu-tual information will exponentially decay; however, for ageneric initial state, the total probability within each setSi remains constant, so the mutual information will beasymptotically approach the entropy of the probabilitydistribution across sets, which can be at most I = log2m.In the language of statistical physics, this is an exampleof topological order which leads to constant terms in thecorrelation functions; here, the Markov graph of M is dis-connected, so there are m degenerate equilibrium states.

3. The periodic case

If a Markov process is periodic, with a sequence thatrepeats forever, then the mutual information is a constantand never decays to zero, so the only power law thatcan be attained is the case of slope zero which does notcorrespond to critical behavior.

4. The n > 1 case

The following proof holds only for order n = 1 Markovprocesses, but we can easily extend the results for arbi-trary n. Any n = 2 Markov process can be convertedinto an n = 1 Markov process on pairs of letters X1X2.Hence our proof shows that I(X1X2, Y1Y2) decays ex-ponentially. But for any random variables X,Y , thedata processing inequality [37] states that I(X, g(Y )) ≤I(X,Y ), where g is an arbitrary function of Y . Let-ting g(Y1Y2) = Y1, and then permuting and applyingg(X1, X2) = X1 gives

I(X1X2, Y1Y2) ≥ I(X1X2, Y1) ≥ I(X1, Y1). (B8)

Hence, we see that I(X1, Y1) must exponentially decay.The preceding remarks can be easily formalized into aproof for an arbitrary Markov process by induction on n.

5. The detailed balance case

This asymptotic relation can be strengthened for a sub-class of Markov processes which obey a condition knownas detailed balance. This subclass arises naturally in thestudy of statistical physics [52]. For our purposes, thissimply means that there exist some real numbers Km anda symmetric matrix Sab = Sba such that

Mab = eKa/2Sabe−Kb/2. (B9)

Let us note the following facts. (1) The matrix power issimply (Mτ )ab = eKa/2 (Sτ )ab e

−Kb/2. (2) By the spec-tral theorem, we can diagonalize S into an orthonormalbasis of eigenvectors, which we label as v (or sometimesw), e.g., Sv = λiv and v · w = δvw. Notice that∑

n

MabeKn/2vn =

∑n

eKm/2Smnvn = λieKm/2vm.

Hence we have found an eigenvector of M for every eigen-vector of S. Conversely, the set of eigenvectors of S formsa basis, so there cannot be any more eigenvectors of M .This implies that all the eigenvalues of M are given byP vm = eKm/2vm, and the eigenvalues of P v are λi. Inother words, M and S share the same eigenvalues.

13

(3) µa = 1Z e

Ka is an eigenvector with eigenvalue 1, andhence is the stationary state:∑

b

Mabµb =1

Z

∑b

e(Ka+Kb)/2Sab

=1

ZeKa

∑b

eKb/2Sbae−Ka/2 = µa

∑b

Mba = µa.

(B10)

The previous facts then let us finish the calculation:⟨P (a, b)

P (a)P (b)

⟩=∑ab

(eKa (Sτ )

2ab e−Kb

) (eKb−Ka

)=∑ab

(eKa (Sτ )

2ab e−Kb

) (eKb−Ka

)=∑ab

(Sτ )2ab = ||Sτ ||2.

(B11)

Now using the fact that ||A||2 = tr(ATA

)and is there-

fore invariant under an orthogonal change of basis, wefind that ⟨

P (a, b)

P (a)P (b)

⟩=∑i

|λi|2τ . (B12)

Since the λi’s are both the eigenvalues of M and S, andsince M is irreducible and aperiodic, there is exactly oneeigenvalue λ1 = 1, and all other eigenvalues are less thanone. Altogether,

IR(t1, t2) =

⟨P (a, b)

P (a)P (b)

⟩− 1 =

∑i=2

|λi|2τ . (B13)

Hence one can easily estimate the asymptotic behaviorof the mutual information if one has knowledge of thespectrum of M . We see that the mutual informationexponentially decays, with a decay scale time-scale givenby the second largest eigenvalue λ2:

τ−1decay = 2 log

1

λ2. (B14)

6. Hidden Markov Model

In this subsection, we generalize our findings to hiddenMarkov models and present a proof of Theorem 2. Basedon the considerations in the main body of the text, thejoint probability distribution between two visible statesXt1 , Xt2 is given by

P (a, b) =∑cd

Gbd [(Mτ )dc µc]Gac, (B15)

where the term in brackets would have been there in anordinary Markov model and the two new factors of G

are the result of the generalization. Note that as before,µ is the stationary state corresponding to M. We willonly consider the typical case where M is aperiodic, irre-ducible, and non-degenerate; once we have this case, theother cases can be easily treated by mimicking our aboveproof for or ordinary Markov processes. Using equation(7) and defining g = Mµ gives

P (a, b) =∑cd

Gbd [(Mτ )dc µc]Gac

= gagb + λτ2∑cd

(GbdAdcµcGac) .(B16)

Plugging this in to our definition of rational mutual in-formation gives

IR + 1 =∑ab

P (a, b)2

gagb

=∑ab

(gagb + λτ2

∑cd

GbdAdcµcGac

)+ λ2τ

2 C

= 1 + λτ2∑cd

Adcµc + λ2τ2 C

= 1 + λ2τ2 C,

(B17)

where we have used the facts that∑iGij = 1,

∑iAij =

0, and as before C is asymptotically constant. This showsthat IR ∝ λ2τ

2 exponentially decays.

Appendix C: Power laws for generative grammars

In this appendix, we prove that the rational mutual infor-mation decays like a power law for a sub-class of gener-ative grammars. We proceed by mimicking the strategyemployed in the above appendix. Let G be the linearoperator associated with the matrix Pb|a, the probabilitythat a node takes the value b given that the parent nodehas value b. We will assume that G is irreducible andaperiodic, with no degeneracies. From the above discus-sion, we see that removing the degeneracy assumptiondoes not qualitatively change things; one simply replacesthe procedure of diagonalizing G with putting G in Jor-dan normal form.

Let us start with the weakly correlated case. In this case,

P (a, b) =∑r

µr

(G∆/2

)ar

(G∆/2

)br, (C1)

since as we have discussed in the main text, the parent

node has the stationary distribution µ and G∆/2 givethe conditional probabilities from transitioning from theparent node to the nodes at the bottom of the tree thatwe are interested in. We now employ our favorite trickof diagonalizing G and then writing

(G∆/2)ij = µi + λ∆/22 Aij , (C2)

14

which gives

P (a, b) =∑r

µr

(µa + λ

∆/22 Aar

)(µb + λ

∆/22 Abr

),

=∑r

µr(µaµb + µaεAbr + µbεAar + ε2AarAbr

)(C3)

where we have defined ε = λ∆/22 . Now note that∑

r Aarµr = 0, since µ is an eigenvector with eigenvalue

1 of G∆/2. Hence this simplifies the above to just

P (a, b) = µaµb + ε2∑r

µrAarAbr. (C4)

From the definition of rational mutual information, andemploying the fact that

∑iAij = 0 gives

IR + 1 ≈∑ab

(µaµb + ε2

∑r µrAarAbr

)2µaµb

=∑ab

[µaµb + ε4N2

ab

],

= 1 + ε4||N||2,

(C5)

where Nab ≡ (µaµb)−1/2

∑r µrAarAbr is a symmetric

matrix and || · || denotes the Frobenius norm. Hence

IR = λ2∆2 ||S||2. (C6)

Let us now generalize to the strongly correlated case. Asdiscussed in the text, the joint probability is modified to

P (a, b) =∑rs

Qrs

(G∆/2−1

)ar

(G∆/2−1

)bs, (C7)

where Q is some symmetric matrix which satisfies∑r Qrs = µs. We now employ our favorite trick of diag-

onalizing G and then writing

(G∆/2)ij = µi + εAij , (C8)

where ε ≡ λ∆/2−12 . This gives

P (a, b) =∑rs

Qrs (µa + εAar) (µb + εAbs) ,

= µaµb +∑rs

Qrs(µaεAbs + µbεAar + ε2AarAbs

).

= µaµb +∑s

µaεAbsµs +∑r

µbεAarµr

+ ε2∑rs

QrsAarAbs

= µaµb + ε2∑rs

QrsAarAbs.

(C9)

Now defining the symmetric matrices Rab ≡∑rsQrsAarAbs ≡ (µaµb)

1/2Nab, and noting that∑

aRab = 0, we have

IR + 1 =∑ab

(µaµb + ε2Rab

)2µaµb

=∑ab

[µaµb + ε4N2

ab

],

= 1 + ε4||N||2,

(C10)

which gives

IR = λ2∆−42 ||N||2. (C11)

In either the strongly or the weakly correlated case, notethat N is asymptotically constant. We can write thesecond largest eigenvalue |λ2|2 = q−k2/2, where q is thebranching factor,

IR ∝ q−∆k2/2 ∼∝ q−k2 logq |i−j| = C|i− j|−k2 . (C12)

Behold the glorious power law! We note that the normal-ization C must be a function of the form C = m2f(λ2, q),where m2 is the multiplicity of the eigenvalue λ2. Weevaluate this normalization in the next section.

As before, this result can be sharpened if we assume thatG satisfies detailed balance Gmn = eKm/2Smne

−Kn/2

where S is a symmetric matrix and Kn are just num-bers. Let us only consider the weakly correlated case.By the spectral theorem, we diagonalize S into an or-thonormal basis of eigenvectors v. As before, G and Sshare the same eigenvalues. Proceeding,

P (a, b) =1

Z

∑v

λ∆v vavbe

(Ka+Kb)/2, (C13)

where Z is a constant that ensures that P is properlynormalized. Let us move full steam ahead to computethe rational mutual information:∑ab

P (a, b)2

P (a)P (b)

=∑ab

e−(Ka+Kb)

(∑v

λ∆v vavbe

(Ka+Kb)/2

)2

=∑ab

(∑v

λ∆v vavb

)2

.

(C14)

This is just the Frobenius norm of the symmetric matrixH ≡

∑v λ

∆v vavb! The eigenvalues of the matrix can be

read off, so we have

IR(a, b) =∑i=2

|λi|2∆. (C15)

15

Hence we have computed the rational mutual informationexactly as a function of ∆.In the next section, we use thisresult to compute the mutual information as a function ofseparation |i− j|, which will lead to a precise evaluationof the normalization constant C in the equation

I(a, b) ≈ C|i− j|−k2 . (C16)

1. Detailed evaluation of the normalization

For simplicity, we specialize to the case q = 2 althoughour results can surely be extended to q > 2. Defineδ = ∆/2 and d = |i − j|. We wish to compute the ex-pected value of IR conditioned on knowledge of d. ByBayes rule, p(δ|d) ∝ p(d|δ)p(δ). Now p(d|δ) is given by atriangle distribution with mean 2δ−1 and compact sup-port (0, 2δ). On the other hand, p(δ) ∝ 2δ for δ ≤ δmax

and p(δ) = 0 for δ ≤ 0 or δ > δmax. This new constantδmax serves two purposes. First, it can be thought ofas a way to regulate the probability distribution p(δ) sothat it is normalizable; at the end of the calculation weformally take δmax → ∞ without obstruction. Second,if we are interested in empirically sampling the mutualinformation, we cannot generate an infinite string, so set-ting δmax to a finite value accounts for the fact that ourgenerated string may be finite.

We now assume d� 1 so that we can swap discrete sumswith integrals. We can then compute the conditionalexpectation value of 2−k2δ. This yields

IR ≈∫ ∞

0

2−k2δP (d|δ) dδ =

(1− 2−k2

)d−k2

k2(k2 + 1) log(2), (C17)

or equivalently,

Cq=2 =1− |λ2|4

k2(k2 + 1)

1

log 2. (C18)

It turns out it is also possible to compute the answerwithout making any approximations with integrals:

IR =2−(k2+1)dlog2(d)e ((2k2+1 − 1

)2dlog2(d)e − 2d

(2k2 − 1

))2k2+1 − 1

.

(C19)

The resulting predictions are compared in figure Figure 5.

Appendix D: Estimating (rational) mutualinformation from empirical data

Estimating mutual information or rational mutual infor-mation from empirical data is fraught with subtleties.

0.05

0.10

0.20

0.50

1 5 10 50- 0.020

- 0.015

- 0.010

- 0.005

0.000

0.005

0.010


I R1/

2 (X,Y

)R

esid

ual f

rom

pow

er la

wFIG. 5: Decay of rational mutual information with separationfor a binary sequence from a numerical simulation with prob-abilities p(0|0) = p(1|1) = 0.9 and a branching factor q = 2.The blue curve is not a fit to the simulated data but rather ananalytic calculation. The smooth power law displayed on theleft is what is predicted by our “continuum” approximation.The very small discrepancies (right) are not random but arefully accounted for by more involved exact calculations withdiscrete sums.

It is well known that a naive estimate of the Shannonentropy obtained S = −

∑Ki=1

NiN log Ni

N is biased, gen-erally underestimating the true entropy from finite sam-ples. For example, We use the estimator advocated byGrassberger [53]:

S = logN − 1

N

K∑i=1

Niψ(Ni), (D1)

where ψ(x) is the digamma function, N =∑Ni, and

K is the number of characters in the alphabet. Themutual information estimator can then be estimated byI(X,Y ) = S(X) + S(Y )− S(X,Y ). The variance of thisestimator is then the sum of the variances

var(I) = varEnt(X) + varEnt(Y ) + varEnt(X,Y ),

(D2)

where the varEntropy is defined as

varEnt(X) = var (− log p(X), ) (D3)

where we can again replace logarithms with the digammafunction ψ. The uncertainty after N measurements is

then ≈√

var(I)/N .

To compare our theoretical results with experiment inFig. 4, we must measure the rational mutual information

16

for a binary sequence from (simulated) data. For a binarysequence with covariance coefficient ρ(X,Y ) = P (1, 1)−P (1)2, the rational mutual information is

IR(X,Y ) =

(ρ(X,Y )

P (0)P (1)

)2

. (D4)

This was essentially calculated in [54] by considering thelimit where the covariance coefficient is small ρ � 1. Intheir paper, there is an erroneous factor of 2. To estimatecovariance ρ(d) as a function of d (sometimes confusinglyreferred to as the correlation function), we use the unbi-

ased estimator for a data sequence {x1, x2, · · ·xn}:

ρ(d) =1

n− d− 1

n−d∑i=1

(xi − x) (xi+d − x) . (D5)

However, it is important to note that estimating the co-variance function ρ by averaging and then squaring willgenerically yield a biased estimate; we circumvent this bysimply estimating IR(X,Y )1/2 ∝ ρ(X,Y ).

[1] P. Bak, Physical Review Letters 59, 381 (1987).[2] P. Bak, C. Tang, and K. Wiesenfeld, Physical review A

38, 364 (1988).[3] K. Linkenkaer-Hansen, V. V. Nikouline,

J. M. Palva, and R. J. Ilmoniemi, TheJournal of Neuroscience 21, 1370 (2001),http://www.jneurosci.org/content/21/4/1370.full.pdf+html,URL http://www.jneurosci.org/content/21/4/1370.

abstract.[4] D. J. Levitin, P. Chordia, and V. Menon, Proceedings of

the National Academy of Sciences 109, 3716 (2012).[5] M. Tegmark, ArXiv e-prints (2014), 1401.1219.[6] G. K. Zipf, Human behavior and the principle of least

effort (Addison-Wesley Press, 1949).[7] H. W. Lin and A. Loeb, Physical Review E 93, 032306

(2016).[8] L. Pietronero, E. Tosatti, V. Tosatti, and A. Vespig-

nani, Physica A: Statistical Mechanics and its Ap-plications 293, 297 (2001), ISSN 0378-4371, URLhttp://www.sciencedirect.com/science/article/

pii/S0378437100006336.[9] M. Kardar, Statistical physics of fields (Cambridge Uni-

versity Press, 2007).[10] URL ftp://ftp.ncbi.nih.gov/genomes/Homo_

sapiens/.[11] URL http://www.jsbach.net/midi/midi_solo_

violin.html.[12] URL http://prize.hutter1.net/.[13] URL http://www.lexique.org/public/lisezmoi.

corpatext.htm.[14] A. M. Turing, Mind 59, 433 (1950).[15] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan,

D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock,E. Nyberg, J. Prager, et al., AI magazine 31, 59 (2010).

[16] M. Campbell, A. J. Hoane, and F.-h. Hsu, Artificial in-telligence 134, 57 (2002).

[17] V. Mnih, Nature 518, 529 (2015).[18] D. Silver, A. Huang, C. J. Maddison, A. Guez,

L. Sifre, G. van den Driessche, J. Schrittwieser,I. Antonoglou, V. Panneershelvam, M. Lanctot, et al.,Nature 529, 484 (2016), URL http://dx.doi.org/10.

1038/nature16961.[19] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush (2015),

1508.06615, URL https://arxiv.org/abs/1508.06615.[20] A. Graves, ArXiv e-prints (2013), 1308.0850.[21] A. Graves, A.-r. Mohamed, and G. Hinton, in 2013 IEEE

international conference on acoustics, speech and signalprocessing (IEEE, 2013), pp. 6645–6649.

[22] R. Collobert and J. Weston, in Proceedings of the 25thinternational conference on Machine learning (ACM,2008), pp. 160–167.

[23] J. Schmidhuber, Neural Networks 61, 85 (2015).[24] Y. LeCun, Y. Bengio, and G. Hinton, Nature 521, 436

(2015).[25] R. Meir and E. Domany, Phys. Rev. Lett. 59,

359 (1987), URL http://link.aps.org/doi/10.1103/

PhysRevLett.59.359.[26] K. Hornik, M. Stinchcombe, and H. White, Neural net-

works 2, 359 (1989).[27] A. M. Saxe, J. L. McClelland, and S. Ganguli, arXiv

preprint arXiv:1312.6120 (2013).[28] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio,

in Advances in neural information processing systems(2014), pp. 2924–2932.

[29] H. Mhaskar, Q. Liao, and T. Poggio, ArXiv e-prints(2016), 1603.00988.

[30] M. Bianchini and F. Scarselli, IEEE transactions on neu-ral networks and learning systems 25, 1553 (2014).

[31] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, andS. Ganguli, ArXiv e-prints (2016), 1606.05340.

[32] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, andJ. Sohl-Dickstein, ArXiv e-prints (2016), 1606.05336.

[33] P. Mehta and D. J. Schwab, ArXiv e-prints (2014),1410.3831.

[34] S. Hochreiter and J. Schmidhuber, Neural computation9, 1735 (1997).

[35] C. E. Shannon, ACM SIGMOBILE Mobile Computingand Communications Review 5, 3 (1948).

[36] S. Kullback and R. A. Leibler, Ann. Math. Statist.22, 79 (1951), URL http://dx.doi.org/10.1214/aoms/

1177729694.[37] T. M. Cover and J. A. Thomas, Elements of information

theory (John Wiley & Sons, 2012).[38] L. R. Rabiner, Proceedings of the IEEE 77, 257 (1989).[39] R. C. Carrasco and J. Oncina, in International Collo-

quium on Grammatical Inference (Springer, 1994), pp.139–152.

[40] N. Chomsky, Aspects of the Theory of Syntax, vol. 11(MIT press, 1965).

[41] S. Ginsburg, The Mathematical Theory of ContextFree Languages.[Mit Fig.] (McGraw-Hill Book Company,1966).

http://www.jneurosci.org/content/21/4/1370.abstract

http://www.jneurosci.org/content/21/4/1370.abstract

http://www.sciencedirect.com/science/article/pii/S0378437100006336


ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/

ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/

http://www.jsbach.net/midi/midi_solo_violin.html

http://www.jsbach.net/midi/midi_solo_violin.html

http://prize.hutter1.net/

http://www.lexique.org/public/lisezmoi.corpatext.htm

http://www.lexique.org/public/lisezmoi.corpatext.htm

http://dx.doi.org/10.1038/nature16961

http://dx.doi.org/10.1038/nature16961

https://arxiv.org/abs/1508.06615

http://link.aps.org/doi/10.1103/PhysRevLett.59.359

http://link.aps.org/doi/10.1103/PhysRevLett.59.359

http://dx.doi.org/10.1214/aoms/1177729694

http://dx.doi.org/10.1214/aoms/1177729694

17

[42] T. L. Booth, in Switching and Automata Theory, 1969.,IEEE Conference Record of 10th Annual Symposium on(IEEE, 1969), pp. 74–81.

[43] T. Huang and K. Fu, Information Sciences 3, 201 (1971),ISSN 0020-0255, URL http://www.sciencedirect.com/

science/article/pii/S0020025571800075.[44] K. Lari and S. J. Young, Computer speech & lan-

guage 4, 35 (1990).[45] D. Harlow, S. H. Shenker, D. Stanford, and L. Susskind,

Physical Review D 85, 063516 (2012).[46] L. Van Hove, Physica 16, 137 (1950).[47] J. A. Cuesta and A. Sanchez, Journal of Statistical

Physics 115, 869 (2004), cond-mat/0306354.

[48] G. Evenbly and G. Vidal, Journal of Statistical Physics145, 891 (2011).

[49] N. Chomsky, Information and control 2, 137 (1959).[50] M. Mahoney, Large text compression benchmark.[51] A. Karpathy, J. Johnson, and L. Fei-Fei, ArXiv e-prints

(2015), 1506.02078.[52] C. W. Gardiner et al., Handbook of stochastic methods,

vol. 3 (Springer Berlin, 1985).[53] P. Grassberger, ArXiv Physics e-prints (2003),

physics/0307138.[54] W. Li, Journal of Statistical Physics 60, 823 (1990),

ISSN 1572-9613, URL http://dx.doi.org/10.1007/

BF01025996.



http://dx.doi.org/10.1007/BF01025996

http://dx.doi.org/10.1007/BF01025996

Documents

arXiv:1606.06737v2 [cond-mat.dis-nn] 11 Jul 2016 · ab = P(X t+1 = ajX t = b). Such Markov matrices (also known as stochastic matri-ces) thus have the properties M ab 0 and P a M