Hidden Gibbs Models: Theory and Applicationspub.math.leidenuniv.nl/~verbitskiyea/pages/files/... · Bernoulli random variables with P(Zn = 1) = e, or the channel can have memory,

Hidden Gibbs Models:Theory and Applications

– DRAFT –

Evgeny Verbitskiy

Mathematical Institute, Leiden University, The Netherlands

Johann Bernoulli Institute, Groningen University, The Netherlands

November 21, 2015

2

Contents

1 Introduction 11.1 Hidden Models / Hidden Processes . . . . . . . . . . . . . . . 3

1.1.1 Equivalence of “random” and “deterministic” settings. 4

1.1.2 Deterministic chaos . . . . . . . . . . . . . . . . . . . . . 4

1.2 Hidden Gibbs Models . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 How realistic is the Gibbs assumption? . . . . . . . . . 8

I Theory 9

2 Gibbs states 112.1 Gibbs-Bolztmann Ansatz . . . . . . . . . . . . . . . . . . . . . 11

2.2 Gibbs states for Lattice Systems . . . . . . . . . . . . . . . . . 12

2.2.1 Translation Invariant Gibbs states . . . . . . . . . . . . . 13

2.2.2 Regularity of Gibbs states . . . . . . . . . . . . . . . . . 15

2.3 Gibbs and other regular stochastic processes . . . . . . . . . . 16

2.3.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.2 k-step Markov chains . . . . . . . . . . . . . . . . . . . . 18

2.3.3 Variable Length Markov Chains . . . . . . . . . . . . . . 19

2.3.4 Countable Mixtures of Markov Chains (CMMC’s) . . . 20

2.3.5 g-measures (chains with complete connections) . . . . 20

2.4 Gibbs measures in Dynamical Systems . . . . . . . . . . . . . 27

2.4.1 Equilibrium states . . . . . . . . . . . . . . . . . . . . . . 28

3 Functions of Markov processes 313.1 Markov Chains (recap) . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.1 Regular measures on Subshifts of Finite Type . . . . . . 33

3.2 Function of Markov Processes . . . . . . . . . . . . . . . . . . 33

3.2.1 Functions of Markov Chains in Dynamical systems . . 35

3.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Functions of Markov Chains from the Thermodynamic pointof view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3

4 CONTENTS

4 Hidden Gibbs Processes 454.1 Functions of Gibbs Processes . . . . . . . . . . . . . . . . . . . 45

4.2 Yayama . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Review: renormalization of g-measures . . . . . . . . . . . . . 47

4.4 Cluster Expansions . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5 Cluster expansion . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Hidden Gibbs Fields 535.1 Renormalization Transformations in Statistical Mechanics . . 54

5.2 Examples of pathologies under renormalization . . . . . . . . 55

5.3 General Properties of Renormalised Gibbs Fields . . . . . . . 60

5.4 Dobrushin’s reconstruction program . . . . . . . . . . . . . . 60

5.4.1 Generalized Gibbs states/“Classification” of singularities 61

5.4.2 Variational Principles . . . . . . . . . . . . . . . . . . . . 64

5.5 Criteria for Preservation of Gibbs property under renormal-ization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

II Applications 73

6 Hidden Markov Models 756.1 Three basic problems for HMM’s . . . . . . . . . . . . . . . . . 75

6.1.1 The Evaluation Problem . . . . . . . . . . . . . . . . . . 77

6.1.2 The Decoding or State Estimation Problem . . . . . . . 78

6.2 Gibbs property of Hidden Markov Chains . . . . . . . . . . . 78

7 Denoising 81

A Prerequisites 89A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

A.2 Measure theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

A.3 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . 92

A.4 Ergodic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.5 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A.5.1 Shannon’s entropy rate per symbol . . . . . . . . . . . . 94

A.5.2 Kolmogorov-Sinai entropy of measure-preserving sys-tems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Chapter 1

Introduction

Last modified on April 3, 2015

Very often the true dynamics of a physical system is hidden from us: weare able to observe only a certain measurement of the state of the underly-ing system. For example, air temperature and the speed of wind are easilyobservable functions of the climate dynamics, while electroencephalogramprovides insight into the functioning of the brain.

Partial Observability Noise Coarse Graining

Observing only partial information naturally limits our ability to describeor model the underlying system, detect changes in dynamics, make pre-dictions, etc. Nevertheless, these problems must be addressed in practical

1

2 CHAPTER 1. INTRODUCTION

situations, and are extremely challenging from the mathematical point ofview.

Mathematics has proved to be extremely useful in dealing with imperfectdata. There are a number of methods developed within various mathe-matical disciplines such as

• Dynamical Systems

• Control Theory

• Decision Theory

• Information Theory

• Machine Learning

• Statistics

to address particular instances of this general problem – dealing withimperfect knowledge.

Remarkably, many problems have a common nature. For example, cor-recting signals corrupted by noisy channels during transmission, analyz-ing neuronal spikes, analyzing genomic sequences, and validating the so-called renormalization group methods of theoretical physics, all have thesame underlying mathematical structure: a hidden Gibbs model. In thismodel, the observable process is a function (either deterministic or ran-dom) of a process whose underlying probabilistic law is Gibbs. Gibbs mod-els have been introduced in Statistical Mechanics, and have been highlysuccessful in modeling physical systems (ferromagnets, dilute gases, poly-mer chains). Nowadays Gibbs models are ubiquitous and are used byresearchers from different fields, often implicitly, i.e., without knowledgethat a given model belongs to a much wider class.

A well-established Gibbs theory provides an excellent starting point todevelop Hidden Gibbs theory. A subclass of Hidden Gibbs models isformed by the well-known Hidden Markov Models. Several exampleshave been considered in Statistical Mechanics and the theory of DynamicalSystems. However, many basic questions are still unanswered. Moreover,the relevance of Hidden Gibbs models to other areas, such as InformationTheory, Bioinformatics, and Neuronal Dynamics, has never been madeexplicit and exploited.

Let us now formalise the problem, first by describing the mathematicalparadigm to model partial observability, effects of noise, coarse-graining,and then, introducing the class of Hidden Gibbs Models.

1.1. HIDDEN MODELS / HIDDEN PROCESSES 3

1.1 Hidden Models / Hidden Processes

Let us start with the following basic model, which we will specify furtherin the subsequent chapters.

The observable process Yt is a function of the hidden time-dependent process Xt.

We allow the process Xt to be either

• a stationary random (stochastic) process,

• a deterministic dynamical process

Xt+1 = f (Xt).

For simplicity, we will only consider processes with discrete time, i.e.,t ∈ Z or Z+. However, many of what we discuss applies to continuoustime processes as well.

The process Yt is a function of Xt. The function can be

• deterministic Yt = f (Xt) for all t,

• or random: Yt ∼ PXt(·), i.e. Yt is chosen according to some probabil-ity distribution, which depends on Xt.

Remark 1.1. If not stated otherwise, we will implicitly assume that in caseYt is a random function of the underlying hidden process Xt, then Yt ischosen independently for every t. For example,

Yt = Xt + Zt,

where the “noise” Zt is a sequence of independent endemically dis-tributed random variables. In many practical situations discussed belowone can easily allow for the “dependence", e.g., Zt is a Markov process.Moreover, as we will see latter, the case of a random function can reducedto the deterministic case, hence, any stationary process Zt can be usedto model noise. However, despite the fact the models are equivalent (seenext subsection), it is often convenient to keep implicit dependence on thenoise parameters.


1.1.1 Equivalence of “random” and “deterministic” settings.

There is one-to-one correspondence between stationary processes andmeasure-preserving dynamical systems.

Proposition 1.2. Let (Ω,A, µ, T) be a measure-preserving dynamical system.Then for any measurable ϕ : Ω→ R,

Y(ω)t = ϕ(Ttω), ω ∈ Ω, t ∈ Z+ or Z, (1.1.1)

is a stationary process. In the opposite direction, any real-valued stationary pro-cess Yt gives rise to a measure preserving dynamical system (Ω,A, µ, T) anda measurable function φ : Ω→ R such that (1.1.1) holds.

- Exercise 1.3. Prove Proposition 1.2.

1.1.2 Deterministic chaos

Chaotic systems are typically defined as systems where trajectories de-pend sensitivelyon the initial conditions, that is, if small difference in ini-tial conditions results, over time, in substantial differences in the states:small causes can produce large effects. Simple observables (functions)Yt of trajectories of chaotic dynamical systems Xt, Xt+1 = f (Xt), canbe indistinguishable from “random” processes.

Example 1.4. Consider a piece-wise expanding map f : [0, 1]→ [0, 1],

f (x) = 2x mod 1,

and the following observable φ :

φ(x) =

0, x ∈ [0, 1

2 ] =: I0,1, x ∈ (1

2 , 1] := I1.

Since the map f is expanding: for x, x′ close,

d( f (x), f (x′)) = 2d(x, x′),

the dynamical system is chaotic. Furthermore, one can easily show thatthe Lebesgue measure on [0, 1] is f -invariant, and the process Yn = φ( f n(x))is actually a Bernoulli process.

Theorem 1.5 (Sinai’s Theorem on Bernoulli factors).

1.1. HIDDEN MODELS / HIDDEN PROCESSES 5

At the same time, often a time series looks "completely random", whileit has a very simple underlying dynamical structure. In the figure below,the Gaussian white noise is compared visually vs the so-called determin-istic Gaussian white noise. The reconstruction techniques (c.f., Chapter??) easily allow to distinguish the time series, as well as to identify theunderlying dynamics.

Figure 1.1: (a) Time series of the Gaussian and deterministic Gaussianwhite noise. (b) Corresponding probability densities of single observations.(c) Reconstruction plots Xn+1 vs Xn.

Many vivid examples of close resemblance between the "random" and"deterministic" processes can be found in the literature. For example, thefollowing figure is taken from Daniel T. Kaplan, Leon Glass, Physica D 64,431-453 (1993) demonstrates visual similarity between several natural andsurrogate models.


1.2 Hidden Gibbs Models

The key assumption of the model is that the observable process is afunction, either deterministic or random, of the underlying process,whose governing probability distribution belongs to the class of Gibbsdistributions.

(i) The most famous example covered by the proposed dynamic probabilis-tic models are the so-called Hidden Markov Models (HMM). SupposeXn is a Markov process, assuming values in a finite state space A. InHMM’s, the observable output process Yn with values in B, is dependentonly on Xn, and the value Yn ∈ B is chosen according to the so-calledemission distributions:

Πij = P(Yn = j |Xn = i), i ∈ A, j ∈ B, (1.2.1)

for each n independently. The Hidden Markov models can be found in abroad range of applications: from speech and handwriting recognition, to

1.2. HIDDEN GIBBS MODELS 7

bioinformatics and signal processing.

(ii) A Binary Symmetric Channel (BSC) is a widely used model of acommunication channel in coding theory and information theory. In thismodel, a transmitter sends a bit

0 0

1 1

1− ε

εε

1− ε

(a zero or a one), and the receiverreceives a bit. Typically, the bit istransmitted correctly, but, with asmall probability ε, the bit will beinverted or flipped.We can representthe action of the channel as

Yn = Xn ⊕ Zn,

where ⊕ is the exclusive OR opera-tion, and Zn is the state of the channel: 0 – transmit "as is", 1 – flip. Thechannel can be memoryless: in this case, Zn is a sequence of independentBernoulli random variables with P(Zn = 1) = ε, or the channel can havememory, e.g., if Zn is Markov (Gilbert-Elliott channel, suitable to modelburst errors). If Xn is a binary Markov process, then this model is aparticular example of a Hidden Markov model.

(iii) Neuronal spikes are sharp pronounced deviations from the base-line in recordings of neuronal activity. Often, the electrical neuronal ac-tivity is transformed into 0/1 processes, indicating the presence of ab-sence of spikes. These temporal binary processes – spike trains, are clearlyfunctions of the underlying neuronal dynamics. In the field of NeuralCoding/Decoding, the relationship between the stimulus and the result-ing spike responses is investigated. Another, particularly interesting typeof spike processes originates from the so-called interictal (between theseizures) spikes. The presence of interictal spikes is used diagnostically asa sign of epilepsy.


Figure 1.2: (a) The spike trains emitted by 30 neurons in the visual areaof a monkey brain are shown for a period of 4 seconds. The spike trainemitted by each of the neurons is recorded in a separate row. (sourceW.Maas, http://www.igi.tugraz.at/maass/). (b) EEG recordings of a pa-tient with Rolandic epilepsy provided by the Epilepsy Centre Kempen-haeghe. Epileptic interictal spikes are clearly visible in the last 3 epochs.

(iv) The decimation is a classical example of renormalization transforma-tion in

T2−→

Statistical Mechanics. Consider ad-dimensional lattice system Ω =

AZd, with the single-spin space A.

The image of a spin configurationx ∈ Ω, under the decimation trans-formation with spacing b, is againa spin configuration y in Ω, y =Tb(x), given by

yn = xbn for all n ∈ Zd.

Under the decimation transformation, only every bd-th spin survives. Fig-ure on the right depicts the action of the decimation transformation T2 onthe two-dimensional integer lattice Z2.

1.2.1 How realistic is the Gibbs assumption?

ADVANTAGES OF GIBBS ASSUMPTION

Part I

Theory

9

Chapter 2

Gibbs states

Last modified on September 23, 2015

Existing literature on rigorous mathematical theory of Gibbs states is quiteextensive. The classical references are the books of Ruelle [18] and Georgii[7, 8]. In the context of dynamical systems, the book of Bowen [11, 12]provides a nice introduction to the subject.

2.1 Gibbs-Bolztmann Ansatz

The Gibbs distribution prescribes the probability of the event that thesystem with the finite state space S is in state x ∈ S , to be equal to

P(x) =1Z

exp(− 1

kTH(x)

), (2.1.1)

where H(x) is the energy of the state x, k is the Boltzmann constant, T isthe temperature of the system, and

Z = ∑y∈S

exp(− 1

kTH(y)

)is a normalising factor, known as the partition function.

Exercise. Show that the probability measure P on S given by (2.1.1) isprecisely the unique probability measure on S which maximizes the fol-lowing functional, defined on probability measures on S by

P 7→ − ∑x∈S

P(x) log P(x)− β ∑x∈SH(x)P(x) = Entropy− β · Energy,

where β = 1/kT is the inverse temperature.

11

12 CHAPTER 2. GIBBS STATES

2.2 Gibbs states for Lattice Systems

Equation (2.1.1) – known as the Gibbs-Boltzmann Ansatz – makes perfectsense for systems where the number of possible states is finite. Typicallyfor systems with infinitely many states, each state occurs with probabilityzero, and a new interpretation of (2.1.1) is required. This problem wassuccessfully solved in late 1960’s by R.L. Dobrushin [2], O. Lanford andD. Ruelle [3]. The key is to identify Gibbs states as those probabilitymeasures on the infinite state space which have very specific finitevolume conditional probabilities, parametrized by boundary conditions,and given by expressions similar to (2.1.1).

The definition of Gibbs measures is applicable to rather general lattice sys-tems Ω = AL, where A is a finite or a compact spin-space, and L is a lattice.For simplicity, let us assume that A is finite, and L is an integer lattice Zd,d ≥ 1. Elements ω of Ω are functions on Zd with values in A, equivalently,at each site n

σΛcωΛ

in Zd, “sits" a spin ωn with a value in A. If ω ∈ Ω, and Λ isa subset of Zd, ωΛ will denote the restriction of ω to Λ, i.e.,ωΛ ∈ AΛ. Fix a finite volume Λ is Zd, and some configura-tion ωΛc outside of Λ, we will refer to ωΛc as the boundaryconditions. Gibbs measures are defined by prescribing the con-ditional probability of observing configuration σΛ in the finite volume Λ,given the boundary conditions ωΛc

P(σΛ |ωΛc) =1

Z(ωΛc)e−βHΛ(σΛωΛc ) =: γΛ(σΛ|ωΛc), (2.2.1)

where β = 1kT is the inverse temperature, HΛ is the corresponding energy,

and the normalising factor – partition function Z, now depends on theboundary conditions ωΛc . The analogy with (2.1.1) is clear.

Naturally, the family of conditional probabilities Γ = γΛ(·|σΛc), calledspecification, indexed by finite sets Λ and boundary conditions σΛc , can-not be arbitrary, and must satisfy some consistency conditions.

In order to ensure the consistency of the family of probability kernelsγΛ, the energies HΛ must be of a very special form, namely:

HΛ(σΛωΛc) = ∑V∩Λ 6=∅

Ψ(V, σΛωΛc), (2.2.2)

where the sum is taken over all finite subsets V of Zd, and the collectionof functions

Ψ = Ψ(V, ·) : Ω→ R : V ⊂ Zd, |V| < ∞

2.2. GIBBS STATES FOR LATTICE SYSTEMS 13

called the interaction, has the following properties:

• (locality): for all η ∈ Ω, Ψ(V, η) = Ψ(V, ηV), meaning that the con-tribution of volume V to the total energy only depends on the spinsin V.

• (uniform absolute convergence):

|||Ψ||| = supn∈Zd

∑V3n

supη∈Ω|Ψ(V, ηV)| < +∞.

Definition. A probability measure P on Ω = AZd

is called a Gibbs statefor the interaction Ψ = Ψ(Λ, ·) |Λ b Zd (denoted by P ∈ G(Ψ) if(2.2.1) holds P-a.s., equivalently, if the so-called Dobrushin-Lanford-Ruelleequations hold for all f ∈ L1(Ω,P) ( f ∈ C(Ω,R)) and all Λ b Zd

∫Ω

f (ω)P(dω) =∫

Ω(γΛ f )(ω)P(dω),

where(γΛ f )(ω) = ∑

σΛ∈AΛ

γΛ(σΛ|ωΛc) f (σΛωΛc).

This is equivalent to saying that

EP( f |FΛc) = γΛ f ,

where FΛc is the σ-algebra generated by spins in Λc.

Theorem 2.1. If Ψ is a uniformly absolutely convergent interaction, then the setof Gibbs measures for Ψ is not empty.

Thus the Dobrushin-Lanford-Ruelle (DLR) theory guarantees existence ofat least one Gibbs state for any interaction Ψ as above, i.e., a Gibbs stateconsistent with the corresponding specification. It is possible that multipleGibbs states are consistent with the same specification; in this case, onespeaks of a phase-transition.

2.2.1 Translation Invariant Gibbs states

Lattice (group) Zd acts on AZd

by translations:

Zd 3 k→ Sk,


where Sk(ω) = ω with

ωn = ωn+k for all n ∈ Zd,

i.e. configuration ω is shifted by k. Recall, that the measure P on Ω = AZd

is called translation invariant if

P(S−kA) = P(A)

for every Borel set A.

When are Gibbs states translation invariant?

What are sufficient conditions for existence of translation invariant Gibbsstates?

Definition 2.2. An interaction Φ = φ(Λ, ·) | Λ b Zd is called transla-tion invariant if

φΛ+k(ω) = φΛ(Skω) (φΛ+k = φΛ Sk)

for all Λ b Zd, k ∈ Zd, and every ω ∈ Ω = AZd.

Example 2.3 (Ising Model of interacting spins). The alphabet A = −1, 1represents two possible spin values ”up" and “down", and the interactionis defined as follows

φΛ(ω) =

−hωi, if Λ = i,−Jωiωj, if Λ = i, j and i ∼ j,0, otherwise

The notation i ∼ j indicates that sites i and j are nearest neighbours,i.e., ||i − j||1 = 1. Parameterh is the strength of the external magneticfield, and the coupling constant J determines whether the interaction isferromagnetic (J > 0) or anti-ferromagnetic (J < 0). In a ferromagneticcase, spins “prefer” to be aligned, i.e., aligned configurations have higherprobabilities. The spins also prefer to line up with the external field .

If φ is translation invariant, then so is the specification

γΛ(Skσ|Skω) = γΛ+k(σ|ω).

Theorem 2.4. If γ = γΛ(·|·) is a translation invariant specification, then theset of translation invariant Gibbs measures for γ is not empty.

2.2. GIBBS STATES FOR LATTICE SYSTEMS 15

2.2.2 Regularity of Gibbs states

If the interaction Ψ is as above, then the corresponding specification Γ =γΛ(·|·) has the following important regularity properties:

(i) Γ is uniformly non-null or finite energy: the probability kernelsγΛ(·|·) are uniformly bounded away from 0 and 1, i.e., for everyfinite set Λ, there exist positive constants αΛ, βΛ such that

0 < αΛ ≤ γ(σΛ|ωΛc) ≤ βΛ(< 1)

for all σΛωΛc ∈ Ω.

(ii) Γ is quasi-local: for any Λ, the kernels are continuous in producttopology: i.e., as V Zd

supω,η,ζ,χ

∣∣∣γΛ(ωΛ|ηV\ΛζVc)− γΛ(ωΛ|ηV\ΛχVc)∣∣∣→ 0,

in other words, changing remote spins (spins in Vc), has diminishingeffect of the distribution of the spins in Λ.

Consider now the inverse problem: suppose P is a probability measure onΩ. Can we decide whether it is Gibbs, equivalently, if it is consistent witha Gibbs specification Γ = ΓΨ for some Ψ.

Naturally, for a measure to be Gibbs, it must, at least, be consistent withsome uniformly non-null quasi-local specification Γ = γΛ(·|·)Λ:

P(σΛ |ωΛc) = γΛ(σΛ|ωΛc) (2.2.3)

for P-a.a. ω. The necessary and sufficient conditions for (2.2.3) to hold aregiven by the following theorem.

Theorem 2.5. A measure P on Ω = AZd

with full support is consistent withsome some uniformly non-null quasi-local specification if and only if for everyΛ b Zd and all σΛ,

P(σΛ|ωV\Λ)

converge uniformly in ω as V Zd.

Suppose now, the probability measure P passed the test of Theorem 2.5,and is thus consistent with some uniformly non-null quasi-local specifi-cation Γ = γΛ(·|·)Λ. Is it possible to find an interaction Ψ such thatΓ = ΓΨ, i.e.,

γ(σΛ|ωΛc) =1

Z(ωΛc)e−HΨ

Λ(σΛωΛc ) =: γΨ(σΛ|ωΛc),


withHΨ

Λ(σΛωΛc) = ∑V∩Λ 6=∅

Ψ(V, σΛωΛc)

depending on the answer, the measure P will or will not be Gibbs.

It turns out that the answer, given by Kozlov [5] and Sullivan [6], is alwayspositive.

Theorem 2.6. If Γ = γ(·|·) is a uniformly non-null quasi-local specificationthen there exists an uniformly absolutely convergent interaction Ψ such that

γ(σΛ|ωΛc) = γΨ(σΛ|ωΛc).

Equivalently, if P is consistent with some uniformly non-null quasi-local speci-fication, then there exists a uniformly absolutely convergent interaction Ψ, suchthat P is a Gibbs state for Ψ.

However, in case the uniformly non-null quasi-local specification is transla-tion invariant, the corresponding interaction is not necessarily translationinvariant. There is a procedure to derive a translation invariant interaction,consistent with a given translation invariant specification, but the resultinginteraction is not necessarily uniformly absolutely convergent, but doessatisfy weaker summability condition. In fact it is an open problem in thetheory of Gibbs states:

Open problem: Does every translation invariant uniformly non-null quasi-local specification allow a Gibbs representation for some translation in-variant uniformly absolutely convergent interaction.

The theory of Gibbs states is extremely rich [7, 18]. In particular, there is adual possibility to characterize Gibbs states locally (via specifications) orglobally (via variational principles). A variety of probabilistic techniquesis available for Gibbs distributions: limit theorems, coupling techniques,cluster expansions, large deviations.

2.3 Gibbs and other regular stochastic processes

Gibbs probability distributions have been highly useful in describing phys-ical systems (ferromagnets, dilute gases, polymers). However, the intrin-sically multi-dimensional theory of Gibbs random fields is of course ap-plicable to one-dimensional random processes as well. To define a Gibbsrandom process Xn we have to prescribe two-sided conditional distribu-tions

P(X0 = a0|X1−∞ = a1

−∞, X+∞1 = a+∞

1 ), ai ∈ A, (2.3.1)

2.3. GIBBS AND OTHER REGULAR STOCHASTIC PROCESSES 17

i.e., conditional distribution of the present given the complete past andthe complete future. Defining processes via the one-sided conditionalprobabilities

P(X0 = a0|X1−∞ = a1

−∞),

using only past values, is probably more popular.

In fact, the full Gibbsian description has rarely been used outside the realmof Statistical Physics. Nevertheless, two-sided descriptions have definiteadvantages in some applications as we will demonstrate in subsequentchapters.

On the other hand, the idea of using in some sense regular conditionalprobabilities to define random process is very natural and has been em-ployed on multiple occasions in various fields. Let us now review some ofthe most popular regular one-sided models.

As we have seen in preceding sections, Gibbs measures are characterizedby regular (quasi-local and non-null) specifications: two-sided conditionalprobabilities.

At the same time, many probabilistic models have introduced in the past100 years which require some form of regularity of one-sided conditionalprobabilities. In many cases, it turns out that one-sided regularity also im-plies the two-sided regularity, and hence these processes are also Gibbs.1

Let us review some of these models.

2.3.1 Markov Chains

Introduced by Markov in early 1900’s.

Definition 2.7. The process Xnn≥0 with values in A, |A| < ∞, is Markovif it has the so-called Markov property:

P(Xn = an|Xn−1 = an−1, Xn−2 = an−2, . . . ) = P(Xn = an|Xn−1 = an−1)(2.3.2)

for all n ≥ 1 and all an, an−1, · · · ∈ A, i.e., only immediate past has influ-ence on the present.

Clearly, (2.3.2) is a very strong form of regularity of conditional probabili-ties.

1From a practical point of view, studying functions of “one-sided regular” processes isalso important. In fact, the basic question: what happens to functions of regular processes.


We will mainly be interested in stationary Markov processes, or equiva-lently, in time-homogeneous chains:

P(Xn = an|Xn−1 = an−1) = pan−1,an

depends only on symbols an−1, an, and not n. Time-homogeneous Markovchains are parametrized by the transition probability matrices: non-negativestochastic matrices

P =(

pa,a′

)a,a′∈A

, pa,a′ ≥ 0, ∑a,a′

pa,a′ = 1.

And hence

P(X0 = a0, X1 = a1, . . . , Xan = an) =n−1

∏i=1P(Xi+1 = ai+1|Xi = ai) ·P(X0 = a0)

(telescoping).

The law P of the Markov process (chain) Xn is shift-invariant if theinitial distribution

π =(P(X0 = a1) . . . P(X0 = am)

), A = a1, . . . , am,

satisfiesπP = π.

Remark 2.8. The law P of a stationary (shift-invariant) Markov chaingives us a measure on Ω = AZ+ = ω = (ω0, ω1, . . . ) : ωi ∈ A. Anytranslation invariant measure on Ω = AZ+ can be extended in a uniqueway to a measure on Ω = AZ (by translation invariance we can determinemeasure of any cylindric event).

- Exercise 2.9. Suppose P = (pa,a′)a,a′∈A is a strictly positive stochasticmatrix. Then there exists a unique initial distribution π which gives riseto a stationary Markov chain. Moreover, the corresponding stationary lawP, when extended to Ω = AZ, is Gibbs. Identify a Gibbs interaction Ψ ofP.

2.3.2 k-step Markov chains

Sometimes Markov chains can be non-adequate as models of processesof interest, e.g., when process has longer “memory”. For these situations,


Markov chains with longer memory — k-step Markov chain have beenintroduced. They are characterized by the k-step Markov property:

P(Xn = an|Xn−1 = an−1, Xn−2 = an−2, . . . )= P(Xn = an|Xn−1 = an−1, . . . , Xn−k = an−k)

for all n ≥ k and all an, an−1, · · · ∈ A. One can show that any stationaryprocess can be well approximated by k-step Markov chains as k → ∞.(Markov measures are dense among all shift-invariant measures.) Themain drawback of using k-step Markov chains in applications is the neces-sity to estimate large (= |A|k · (|A| − 1)) number of parameters (= elementsof P =

(p~a,a′

)~a∈Ak,a′∈A).

2.3.3 Variable Length Markov Chains

Variable Length Markov Chains (VLMC’s), also known as probabilistic treesources in Information Theory, are characterized that conditional proba-bilities

P(Xn = an|Xn−1 = an−1, Xn−2 = an−2, . . . )

depend on finitely many preceding symbols, but the number of such sym-bols in varying, depending on the context. More specifically, there existsa function

` : A−N → Z+ (or, 0, 1, . . . , N)such that

P(Xn = an|Xn−1 = an−1, Xn−2 = an−2, . . . )

= P(Xn = an|Xn−1n−`(an−1,an−2,... ) = an−1

n−`(~a)),

if `(~a) = 0, then P(Xn = an|X<n = a<n) = P(Xn = an).

These models are very popular in Information Theory, various efficient al-gorithms to estimate parameters from data are known. If sup~a∈A−N `(~a) =+∞, the process has infinite (unbounded) memory.

Example 2.10. Binary process Xnn∈Z such that

P(

X0 = 1|X−1−∞ = 0k1∗

)= pk

P(

X0 = 0|X−1−∞ = 0k1∗

)= 1− pk, k = 0

P(

X0 = 1|X−1−∞ = 0∞

)= p∞,

P(

X0 = 0|X−1−∞ = 0∞

)= 1− p∞

(2.3.3)

i.e., probability that the next symbol in 1 depends only on how far in thepast the last occurrence of 1 took place.


Conditional probabilities (2.3.3) can be visualizedwith a help of a probabilistic tree : each terminalnode carries (is assigned) a probability distribu-tion on 0, 1 and for a given context (a−1, a−2, . . .)the actual distribution is found by performing thecorresponding walk by following the prescribededges on the tree until a terminal node is reached.

2.3.4 Countable Mixtures of MarkovChains (CMMC’s)

We have

P (a0|a<0) = λ0P(0) (a0) +∞

∑k=1

λkP(k)(

a0|a−1−k

), (2.3.4)

where P(k) is an order-k transition probability and ∑k≥0 λk = 1.In such chains (2.3.4), the next symbol can be generated using the com-bination of two independent random steps. Firstly, with probabilities λk,the k-the Markov chain is selected, and then the next symbol is chosenaccording to P(k). Thus, each transition depends on finite, but randomnumber of preceding symbols.Note that conditional probabilities (2.3.4) are continuous in the followingsense.∣∣P(X0 = a0|X−1

−k = a−1−k, X−k−1

−∞ = b−k−1−∞

)− P

(X0 = a0|X−1

−k = a−1−k, X−k−1

−∞ = c−k−1−∞

)∣∣ ≤ ∑m>k

λm → 0 as k→ ∞.

2.3.5 g-measures (chains with complete connections)

The class of g-measures has been Introduced by M. Keane in 1972 [17].These measures are characterized by regular one-sided conditional prob-abilities. In fact, similar definitions has already been given by variousauthors under a variety of names: e.g., chains with complete connections(chaines a liaisous completes) by Onicescu and Mihoc (1935), Doeblin-Fortet [14] established first uniqueness results, chains of infinite oder byHarris (1995). In our presentation we use the notation inspired by dynam-ical systems, might be strange from a probabilistic point of view, but iscompletely equivalent to a more familiar probabilistic point of definingprocesses, given the past, discussed in the previos sections.


Let A be a finite set, and put

Ω = AZ+ = ω = (ω0, ω1, . . .) : ωn ∈ A for all n ∈ Z+

Denote by G = GΩ the class of continuous (in the product topology),positive functions g : Ω→ [0, 1] which are normalized : for all ω ∈ AZ+

∑a∈A

g (aω) = 1.

where aω ∈ Ω is the sequence (a, ω0, ω1, . . .).

Definition 2.11. A probability measure P on Ω = AZ+ is a called a g-measure if

P (ω0|ω1, ω2, . . .) = g (ω0ω1 · · · ) (P− a.s.), (2.3.5)

equivalently, if for any f ∈ C (Ω,R) one has

∫Ω

f (ω)P (dω) =∫

Ω

[∑a0

f (a0ω) g (a0ω)

]P (dω) . (2.3.6)

We will refer to (2.3.6) as one-sided Dobrushin-Lanford-Ruelle equations.The function g can be viewed an a one-sided quasi-local non-null specifi-cation. One can show [9] that the expression in square brackets in (2.3.6)has the following probabilistic interpretation: namely, as the conditionalexpectation of f with respect to a certain σ-algebra. If A is the σ-algebraof Borel subsets of Ω, and F1 = S−1A = σ

(ωkk≥1

), then

EP( f |F1)(ω) = ∑a0∈A

f (a0ω1ω2 . . .)g(a0ω1ω2 . . .),

where for all continuous f , and more generally, for all f ∈ L1(Ω,P).All previous examples we considered – Markov chains, k-step Markovchains, VLMC’s with bounded memory, CMMC’s, fall into this class of g-measures. It is easy to see that the VLMC with infinite memory in Example2.10:

P(

1|0k1∗)= pk, k ≥ 0, (2.3.7)

P (1|0∞) = p∞, (2.3.8)

is also a g-measure provided

p∞ = limk→∞

pk.


Theorem 2.12. For any g ∈ GΩ, there exists at least one translation invariantg-measure.

Proof. Consider the transfer operator Lg acting on continuous functionsC(Ω) defined as

Lg f (ω) = ∑a∈A

g(aω) f (aω).

The operator Lg : C(Ω) → C(Ω), and hence has a well-defined dualoperator L∗g acting onM(Ω) – the set probability measures on Ω, givenby ∫

Ω(Lg f )dP =

∫Ω

f d(L∗gP). (2.3.9)

Moreover, L∗g maps the set of translation invariant measuresMS(Ω) intoitself and is continuous. The Schauder-Tychonoff fixed point theoremstates that any continuous map of a compact convex set has a fixed point.Since the set of translation invariant probability measuresMS(Ω) is com-pact and convex, operator L∗g has a fixed point, say P, and hence (2.3.6)follows from (2.3.9) since P = L∗gP, and therefore, P is the g-measure.

- Exercise 2.13. Validate that indeed L∗g preservesMS(Ω).

Theorem 2.14 (Walters). If g ∈ GΩ, then any g-measure P has full support.Moreover, if g1, g2 ∈ GΩ are such there is a measure P which is a g1-measure, aswell as g2-measure, then

g1 = g2.

Proof. It is sufficient to prove that any g-measure P assigns positive prob-ability to any cylinder set [cn−1

0 ]. Furthermore, since g is continuous andpositive,

κ = infω∈Ω

g(ω) > 0.

Since P is g-measure, for any continuous function f : Ω→ R one has∫Ω

f (ω)P(dω) =∫

Ω∑

a0∈Af (a0ω)g(a0ω)P(dω),

and iterating the previous identity, one obtains

∫Ω

f (ω)P(dω) =∫

Ω

∑~a=(an−1

0 )∈An

f (an−10 ω)

n−1

∏i=0

g(an−1i ω)

P(dω).

Consider

f (ω) = I[cn−10 ](ω) =

1 if ωi = ci, i = 0, . . . , n− 1,0 otherwise.


Therefore,

P([cn−10 ]) =

∫ΩI[cn−1

0 ](ω)P(dω) =∫

Ω

n−1

∏i=0

g(cn−1i ω)P(dω)

≥ κn∫

ΩP(dω) = κn > 0.

The second claim follows from the claim that if g1 and g2 have a commong-measure, then

g1 = g2 (P− a.s.),

and hence everywhere on Ω.

Theorem 2.15. A fully supported measure P on Ω = AZ+ is a g-measure forsome g ∈ G(Ω) if and only if the sequence

P(ω0|ω1 . . . ωn) =P([ω0 . . . ωn])

P([ω1 . . . ωn]

converges uniformly in ω as n→ ∞.

Proof. Suppose P is g-measure. Using expression for probabilities of cylin-ders obtained in the proof of previous theorem one has

P(cn−10 )

P(cn−11 )

=

∫Ω(∏

n−1i=0 g(cn−1

i ω))P(dω)∫Ω(∏

n−1i=1 g(cn−1

i ω))P(dω).

Thus,

P(c0|cn−11 ) ∈

[infω

g(cn−10 ω), sup

ωg(cn−1

0 ω)

].

Since g is continuous on Ω = AZ+ and hence uniformly continuous, then-th variation of g

varn(g) = supcn−1

0

supω,ω∈Ω

∣∣g(cn−10 ω)− g(cn−1

0 ω)∣∣→ 0

as n→ ∞, and thus

P(c0|cn−11 )/g(c) ∈ [1− δn, 1 + δn],

where δn → 0.

In the opposite direction, gn(c) = P(c0|cn1) are local functions, since they

depend only on finitely many coordinates, and thus gn’s are also continu-ous. This sequence of continuous functions on a compact set is assumedto converge uniformly on a compact metric space ω. Therefore, the limit


function g is also continuous. By the Martingale Convergence Theoremthe limit must be equal to

P(c0|c1, c2, . . .) = g(c0, c1, . . .) P− a.e.,

and hence P is a g-measure.

As we have seen, since g is continuous,

varn(g) = supc,ω,ω|g(cn−1

0 ω)− g(cn−10 ω)| → 0

as n→ ∞. Many properties of g-measures depend on how quickly varn(g)→0. For example, if g has exponentially decaying variation:

varn(g) ≤ cθn

for some c > 0, θ ∈ (0, 1), or more generally, g has summable variation:

∞

∑n=1

varn(g) < ∞,

then there exists a unique g-measure with good mixing properties.

Theorem 2.16. Suppose g ∈ GΩ has summable variations, then g admits aunique g-measure P. Moreover, P (viewed as a measure on AZ) is Gibbs for someuniformly absolutely summable interaction ψ.

Proof. For any f ∈ C(Ω) and every n ≥ 1, one has

Lng f (ω) = ∑

an−10 ∈An

n−1

∏i=0

g(an−1i ω) f (an−1

0 ω).

Consider arbitrary ω, ω ∈ Ω such that ωk−10 = ωk−1

0 , k ∈ N. Then

∣∣Lng f (ω)−Ln

g f (ω)∣∣ ≤ ∑

an−10 ∈An

∣∣∣∣∣n−1

∏i=0

g(an−1i ω)−

n−1

∏i=0

g(an−1i ω)

∣∣∣∣∣ · | f (an−10 ω)|

+ ∑an−1

0 ∈An

n−1

∏i=0

g(an−1i ω)| · | f (an−1

0 ω)− f (an−10 ω)|

≤ || f ||C(Ω) ∑an−1

0 ∈An

∣∣∣∣∣n−1

∏i=0

g(an−1i ω)−

n−1

∏i=0

g(an−1i ω)

∣∣∣∣∣+ varn+k( f ),


since

∑an−1

0 ∈An

n−1

∏i=0

g(an−1i ω) = 1

for all n ≥ 1 and every ω, since g is normalized. Let is now derive auniform estimate of

Cn(an−10 , ω, ω) :=

∣∣∣∣∣n−1

∏i=0

g(an−1i ω)−

n−1

∏i=0

g(an−1i ω)

∣∣∣∣∣=

n−1

∏i=0

g(an−1i ω)

∣∣∣∣∣∏n−1i=0 g(an−1

i ω)

∏n−1i=0 g(an−1

i ω)− 1

∣∣∣∣∣ .

One can continues as follows,

∏n−1i=0 g(an−1

i ω)

∏n−1i=0 g(an−1

i ω)= exp

(n−1

∑i=0

[log g(an−1i ω)− log g(an−1

i ω)])∈ [D−1

k , Dk]

where, taking into account that ωk−10 = ωk−1

0 ,

Dk =∞

∑j=k+1

varj(log g)→ 0 as k→ ∞,

because if g has summable variation, then so is log g (see Exercise 2.17

below). Therefore, for D′k = max(eDk − 1, 1− e−Dk)→ 0, one has

Cn(an−10 , ω, ω) ≤ D′k

n−1

∏i=0

g(an−1i ω),

and hence ∣∣Lng f (ω)−Ln

g f (ω)∣∣ ≤ D′k|| f ||C(Ω) + vark( f ).

The sequence of functions fn = Lng f is uniformly continuous, i.e., for every

ε > 0, there exists k ∈ N such that∣∣Lng f (ω)−Ln

g f (ω)∣∣ < ε

for all n and every ω, ω ∈ Ω such that ωk−10 = ωk−1

0 , i.e., ρ(ω, ω) ≤ 2−k.Moreover, the sequence fn is uniformly bounded

||Lng f ||C(Ω) ≤ || f ||C(Ω).

Hence, by the Arzelá-Ascoli theorem, the sequence fn has a uniformlyconverging subsequence fnj:

Lnjg f ⇒ f∗ ∈ C(Ω).


Now we are going to show that in fact f∗ is constant, and Lng f → f∗ as

n→ ∞. For any continuous function h let

α(h) = minω∈Ω

h(ω).

It is easy to see that for any f ∈ C(Ω) one has

α( f ) ≤ α(Lg f ) ≤ . . . ≤ α(Lng f ) ≤ . . . ,

Moreover, taking into account that Lnjg f → f , and monotonicity of α(Ln

g f ),we conclude that

α( f∗) = limn

α(Lng f ) = α(Lg f∗).

Furthermore, suppose ω ∈ Ω is such that

α( f∗) = α(Lg f∗) = Lg f∗(ω)

but thenLg f∗(ω) = ∑

c0∈Ag(c0ω) f∗(a0ω) = inf

ω∈Ωf (ω),

and hence for each c0,

f∗(c0ω) = infω∈Ω

f (ω) = α( f∗).

Repeating the argument, we conclude that for any cn−10 ∈ An,

f∗(cn−10 ω) = α( f∗),

i.e., f∗ attains its minimal value on every cylinder. Since f∗ is continuous,f∗ thus must be constant, and then Ln

g f ⇒ f∗ as n → ∞. In conclusion,for every f ∈ C(Ω), Ln

g f ⇒ f∗ = const. In other words, we have a linearfunctional V : C(Ω)→ R defined as

V( f ) = f∗,

which is non-negative: f ≥ 0 implies V( f ) ≥ 0, normalized V1 = 1. Bythe Riesz representation theorem, there exists a probability measure P onΩ such that

V( f ) =∫

Ωf dP.

Since obviously V( f ) = V(Lg f ), we have P = L∗gP, and hence P is theg-measure. If P is another g-measure, then integrating

Lng f →

∫Ω

f dP

2.4. GIBBS MEASURES IN DYNAMICAL SYSTEMS 27

with respect to P, we conclude that∫Ω

f dP =∫

Ωf dP

for every continuos f , and hence P = P.

- Exercise 2.17. Validate that indeed if g ∈ GΩ has summable variation,then log g has summable variation as well.

Remark 2.18. (i) Various uniqueness conditions have been obtained in thepast, see [19] for review. The major progress has been achieved Johanssonand Öberg (2002) who proved uniqueness of g-measures for the class ofg-functions with square summable variation:

∑n≥1

(varn(g)

)2< ∞,

see also [16] for a recent result, strengthening and subsuming the previousresults.

(ii) Bramson and Kalikow [13] constructed a first example of a continuouspositive normalized g-function with multiple g-measures. Berger et al [10]showed that 2+ ε summability of variations is not sufficient for uniquenessof g-measures.

(iii) One can try to generalize the theory of g-measures by consideringfunctions g which are not necessarily continuous, but the available theoryis still quite limited [9, 15].

2.4 Gibbs measures in Dynamical Systems

Among all Gibbs measures on a lattice system Ω = AZd, a special subclass

is formed by translation invariant measures. The lattice group Zd acts onΩ by shifts: (Snω)k = ωk+n for all ω, k, n ∈ Zd. The Gibbs measure P isinvariant under this action, if P(S−nC) = P(C) for any event C and all n.

Shortly after the founding work of Dobrushin, Lanford, and Ruelle, Sinai[4] extended the notion of invariant Gibbs measures to more general phase-spaces like manifolds, i.e., spaces lacking the product structure of latticesystems, thus making the notion of Gibbs states applicable to invariantmeasures of Dynamical Systems. In the subsequent works, it was shownthat the important class of the natural invariant measures (now known asthe Sinai-Ruelle-Bowen measures) of a large class of chaotic dynamicalsystems are Gibbs. That sparked a great interest in these objects in the


Dynamical Systems and Ergodic Theory. In Dynamical Systems literatureone can find several definitions (not always equivalent) of Gibbs states.For example, the following definition due to Bowen [11] is very popular.Suppose f is a continuous homeomorphism of a compact metric space(X, d), and P is an f -invariant Borel probability measure on X. Then Pis called Gibbs in the Bowen sense (abbreviated to Bowen-Gibbs) for acontinuous potential ψ : X → R, if for some P ∈ R and every sufficientlysmall ε > 0, there exists C = C(ε) > 1 such that the inequalities

1C≤P(

y ∈ X : d( f j(x), f j(y)) < ε ∀j = 0, . . . , n− 1)

exp(−nP + ∑n−1

j=0 ψ( f j(x))) ≤ C

hold for all x ∈ X and every n ∈ N. In other words, the set of pointsstaying ε-close to x for n-units of time, has measure roughly e−Hn(x) withHn = nP−∑n−1

j=0 ψ( f j(x)). Capocaccia [1] gave the most general definitionof Gibbs invariant measures of multidimensional actions (dynamical sys-tems) on arbitrary spaces, hence extending the Dobrushin-Lanford-Ruelle(DLR) construction substantially.

Remark 2.19. For symbolic systems Ω = AZ or AZ+ with the dynamicsgiven by the left shift S : Ω → Ω, the above definition can be simplified:a translation invariant measure P is called Bowen-Gibbs for potential ψ ∈C(Ω,R) if there exists constants C > 1 and P such that for any ω ∈ Ωand all n ∈ N, one has

1C≤P(

ω ∈ Ω : ωj = ωj ∀j = 0, . . . , n− 1)

exp(−nP + ∑n−1

j=0 ψ(Sj(ω))).

≤ C (2.4.1)

2.4.1 Equilibrium states

Suppose ψ : Ω→ R is a continuous function, again, for simplicity assumethat Ω = AZ

dor AZ

d+ . Define the topological pressure of ψ as

P(ψ) = supP∈M1(Ω,S)

[h(P) +

∫ψdP

],

where supremum is taken over the set of translation invariant probabilitymeasures on Ω.

A measure P ∈ M1(Ω, S) is called an equilibrium state for potentialψ ∈ C(Ω) if

P(ψ) = h(P) +∫

ψdP.

For a given potential ψ, denote the set of equilibrium states ESψ(Ω, S).For symbolic systems, ESψ(Ω, S) 6= ∅ for any continuous ψ.

2.4. GIBBS MEASURES IN DYNAMICAL SYSTEMS 29

Theorem 2.20 (Variational Principles). The following holds

• Suppose Ω = AZd

and Ψ is a translation invariant uniformly absoluteconvergent interaction. Then a translation invariant measure P is Gibbsfor Ψ if and only if P is an equilibrium state for

ψ(ω) = − ∑0∈VbZd

1|V|Ψ(V, ω).

• Suppose Ω = AZ+ and g is a positive continuous normalized function onΩ, i.e.,

∑a0∈A

g(a0ω) = 1

for all ω ∈ AZ+ . Then a translation invariant measure P is a g-measure ifand only if P is an equilibrium state for ψ = log g.

• Suppose Ω = AZ+ or Ω = AZ, a translation invariant measure P, whichis Bowen-Gibbs for ψ, is also an equilibrium state for ψ.

In the last case, it is not true that for an arbitrary continuous function ψ,the corresponding equilibrium states are Bowen-Gibbs. However, undervarious additional assumptions of the “smoothness” of ψ, equilibriumstates are Bowen-Gibbs.

Theorem 2.21 (Effectively (Sinai, 1972), see also Ruelle’s book Chapter).Suppose P is a translation invariant g-measure for a function g ∈ G(AZ+)with varn(g) ≤ Cθn, for some C > 0, θ ∈ (0, 1), and all n ∈ N. Then P isGibbs for some translation invariant uniformly absolutely convergent interactionφ = φΛ(·)ΛbZ.

Idea: Consider ϕ = − log g ∈ C(AZ), but ϕ depends only on (ω0, ω1, · · · ) ∈AZ+ . Let us try to find UAC interaction ϕΛn(·), Λn = [0, n] ∩N, (φΛ ≡ 0 ifΛ is not an interval) such that

ϕ(ω) = −∞

∑n=0

φ([0, n], ωn0 ).

If we can, then P would be Gibbs for Φ.

Why? By Variational Principle, any Gibbs measure for φ is also an equi-librium state for ψ = −∑V30

1|V|φ(V, ·), and vice versa.

Claim: ψ = −∑V301|V|φ(V, ·) and ϕ = −∑∞

n=0 φ([0, n], ·) = −∑0=minV φ(V, ·)have equal sets of equilibrium states reason being∫

ψ dP =∫

ϕ dP


for every translation invariant measure.

∫ψ dP = −

∞

∑n=0

1n + 1

0

∑i=−n

∫φ([0, n], ωn+i

i ) dP

= −∞

∑n=0

1n + 1

(n + 1)∫

φ([0, n], ωn0 ) dP =

∫ϕ dP.

Easy fact to check: varn(g) ≤ Cθn implies that varn(ϕ) ≤ Cθn as well.

Let ϕ = −ϕ. Put

φ0(ω0) = ϕ(ω0,−→0 ),

φ0,1(ω10) = ϕ(ω1

0,−→0 )− ϕ(ω0,

−→0 ),

...

φ0,1,··· ,n(ωn0 ) = ϕ(ωn

0 ,−→0 )− ϕ(ωn−1

0 ,−→0 ).

Here−→0 is any fixed sequence.

Clearly

N

∑n=0

φ0,··· ,n(ωn0 ) = ϕ(ωN

0 ,−→0 ) −−−→

N→∞ϕ(ω) (∀ω ∈ AZ+).

Moreover||φ0,1,··· ,n(·)||∞ ≤ varn−1(ϕ) ≤ Cθn−1,

and hence

∑V30||φV(·)||∞ ≤

∞

∑n=0

(n + 1)||φ0,1,··· ,n(·)||∞ < ∞

as well.

Chapter 3

Functions of Markov processes

Last modified on October 26, 2015

3.1 Markov Chains (recap)

We will consider Markov chains Xn with n ∈ Z+ (or n ∈ Z) with valuesin a finite set A. Time-homogeneous Markov chains are defined/characterizedby probability transition matrices

P =(

pij)

i,j∈A , pij = P (Xn+1 = j | Xn = i) .

where P is a stochastic matrix:

• pij ≥ 0 for all i, j ∈ A,

• ∑j∈A

pij = 1 for all i ∈ A.

If initial probability distribution π = (π1, . . . , πM), πi ≥ 0 for all i ∈ A

and ∑i∈A πi = 1 satisfiesπP = π,

then the process Xnn≥0 defined by

(a) choosing X0 ∈ A according to π,

(b) choosing Xn+1 ∈ A according to pi,· : if Xn = i, then Xn+1 = j withprobability pij independently of all previous choices,

31

32 CHAPTER 3. FUNCTIONS OF MARKOV PROCESSES

is a stationary process. Its translation invariant law P on AZ+ is given by

P (Xn = an, Xn+1 = an+1, . . . , Xn+k = an+k) = π (an)k−1

∏i=0

pan+1,an+i+1

for all n ≥ 1 and (an, . . . , an+k) ∈ Ak+1.

If P is a strictly positive matrix : pij > 0 ∀i, j, then there exists a uniqueinvariant π = πP. The corresponding measure P on AZ+ will havefull support : every cylindric event

Xn = an, . . . , Xn+k = an+k

will have positive probability: it follows from the fact that π > 0 as well.

If P =(

pij)

is not strictly positive, i.e., pij = 0 for some i, j ∈ A, (equiva-lently, if transition from i to j is forbidden), in order to avoid some technicaldifficulties, we will assume that P is primitive:

Pn > 0 for some n ∈ N.

In this case, there exists a unique positive π satisfying π = πiP. Thesupport of the corresponding translation invariant measure P is a subshiftof finite type.

Definition 3.1. Suppose C is an |A| × |A| matrix of 0’s and 1’s. Let

Ω+C =

ω = (ω0, ω1, . . .) ∈ AZ+ : Cωiωi+1 = 1 ∀i ≥ 0

,

ΩC =

ω = (. . . , ω−1, ω0, ω1, . . .) ∈ AZ : Cωiωi+1 = 1 ∀i ∈ Z

are the corresponding one- and two-sided subshifts of finite type (deter-mined by C).

Remark 3.2. Subshifts of finite type (SFT’s) are sometimes called topologi-cal Markov chains. SFT’s were invented by IBM to encode information onmagnetic types (and then CD’s, DVD’s,. . .).

If P is primitive, support of P is a SFT Ω+C , or if viewed as an invariant mea-

sure for Xnn∈Z, a two-sided SFT ΩC, with C =(cij)

being the supportof P = (pij), i.e.,

cij =

1, pij > 00, pij = 0

.

3.2. FUNCTION OF MARKOV PROCESSES 33

3.1.1 Regular measures on Subshifts of Finite Type

Notions of Gibbs and g-measures can be easily extended to SFT’s.

Definition 3.3. Support C is a primitive 0/1 matrix, denoted by G(Ω+

C)

the set of positive continuous normalized functions g : Ω+C → (0, 1):

∑a∈A: aω∈Ω+

C

g (aω) = 1

for all ω ∈ Ω+C . A probability measure P on Ω+

C is called a g-measure forg ∈ G

(Ω+

C)

if

P (ω0|ω1ω2 · · · ) = g (ω) for P− a.a. ω = (ω0, ω1, . . .) ∈ Ω+C .

Theorem 3.4. Fully supported measure P on Ω+C is a g-measure if and only if

P (ω0|ω1 · · ·ωn)⇒ on Ω+C as n→ ∞.

- Exercise 3.5. Suppose Xnn≥0 is a Markov chain for some primitivetranslation probability matrix P. Let P be the corresponding (unique) trans-lation invariant measure on Ω+

C(P). Show that P is a g-measure, identifyfunction g.

3.2 Function of Markov Processes

Definition 3.6. Suppose Xn is a Markov chain with values in a finite setA. Consider also a function ϕ : A→ B, where |B| < |A|, and ϕ is onto. Aprocess Yn

Yn = ϕ (Xn) for all n,

is called a function of a Markov chain.

A variety of namers has been used for processes Yn:

• a Lumped Markov chain,

• an aggregated Markov chain,

• an amalgamated Markov chain,

• a 1-block factor of a Markov chain.


The law of Yn will be denoted by Q. The measure Q is called a pushfor-ward of P and is denoted by

Q = ϕ∗P = P ϕ−1

meaning

Q (C) = P(

ϕ−1 (C))

for any measurable event C ⊂ BZ+ .

- Exercise 3.7. If P is translation invariant, then so is Q.

Lumped/aggregated/amalgamated Markov chains can be found in manyapplications. In particular, in applications, when one has to simulate aMarkov chain on a very large state space, and one would like to acceleratethis process by simulating a chain on a much smaller space.

Suppose B = 1, . . . , k, the task is to find a partition of A into k-sets,A = L1 t · · · t Lk, with the corresponding function ϕ:

Yn = ϕ (Xn) =

1, if Xn ∈ L1

. . . ,k, if Xn ∈ Lk

such that the resulting lumped process Yn is Markov. In this case, clearly

Q(Yn = k) = P (Xn ∈ Lk) .

Alternatively, one might be interested in finding Markov Yn such that

Q(Yn = k) ' P (Xn ∈ Lk) .

A classical1 question is under what conditions is the lumped chain YnMarkov. Two versions of this question :

• Yn = ϕ (Xn) is Markov for every initial distribution π of Xn.

• Yn = ϕ (Xn) is Markov for some initial distribution π of Xn.

Concepts of (strong) lumpability and weak lumpability are related to thesesituations.

1Studied by Dynkin, Kemeny, Snell, Rosenblat,. . .

3.2. FUNCTION OF MARKOV PROCESSES 35

Example 3.8. Consider the following Markov chains and correspondingfunctions:

The process Yn corresponding to the left example is Markov, while thecorresponding process on the right is not Markov.

Theorem 3.9. Yn = ϕ(Xn) is Markov for any initial distribution π if and onlyif for Lj = ϕ−1(j), j ∈ B, any i, and any a ∈ Li,

P(Xn+1 ∈ Lj|Xn = a) = constant.

The common value is precisely the transition probability

pYi,j = P(Yn+1 = j|Yn = i).

For the discussion on sufficient conditions for weak lumpability, see Ke-meny and Snell.

3.2.1 Functions of Markov Chains in Dynamical systems

Suppose Σ1 ⊂ AZ, Σ2 ⊂ BZ are irreducible SFT. The coding map ϕ : A→ B

is onto, extends to a continuous factor map ϕ : Σ1 → Σ2. The coding mapϕ is called Markovian if some Markov measure on Σ1 is mapped into aMarkov measure on Σ2.

The problem of deciding on whether the coding map φ is Markov or not,is a somewhat complementary to problems studied in probability theory(see recent review of by Boyle and Petersen). Here are several interestingresults of this theory:

• Non-Markovian ϕ’s exist. Markov Q on Σ2 has many pre-images P :Q = P ϕ−1, but none of them is Markov.


• One can decide, for a given ϕ, P, whether Q = P ϕ−1 is a k-stepMarkov or not Markov.

The theory also offers many useful tools and concepts, e.g., compensationfunctions which we will discuss latter.

3.3 Hidden Markov Models

Hidden Markov Models (Chains) are random functions of Markov chains

The ingredients are

(i) A Markov chain Xn with values in A, defined by a transition prob-ability matrix P and initial distribution π.

(ii) A family of emission probability kernels

Π = Πi(·)i∈A,

where Πi is a probability measure on the output space B.

The output (observable) process Yn is constructed as follows: for everyn, let independently Yn = b with probability Πi(b) if Xn = i, i. e.

P(Yn = b|Xn = i) = Πi(b).

Example 3.10. Binary Symmetric Markov Chain transmitted through theBinary Symmetric Channel:

0 0

1 1

1− ε

εε

1− ε

• The input Xn, Xn ∈ 0, 1, is a symmetric Markov chain withtransition probability matrix

P =

1− p p

p 1− p

, where p ∈ (0, 1).

3.3. HIDDEN MARKOV MODELS 37

• The noise Zn is an iid sequence of 0’s and 1’s with

P(Zn = 1) = ε, P(Zn = 0) = 1− ε

• The output

Yn = Xn ⊕ Zn,

equivalently,

Yn =

Xn, with probability 1− ε

Xn, with probability ε

where Xn is the bit flip of Xn.

We can represent Yn as a deterministic function of an extended Markovchain

Xextn = (Xn, Zn) ∈ 0, 12,

with transition probability matrix Pext

Pext =

(1− p)(1− ε) (1− p)ε p(1− ε) pε

(1− p)(1− ε) (1− p)ε p(1− ε) pε

p(1− ε) pε (1− p)(1− ε) (1− p)ε

p(1− ε) pε (1− p)(1− ε) (1− p)ε

.

P(Xextn+1 = (an+1, bn+1))|Xext

n = (an, bn)) = P(Xn+1 = an+1|Xn = an)P(Zn+1 = bn+1).

The fact is true in greater generality.

Theorem 3.11. If Yn is a Hidden Markov Chain, then Yn = ϕ(Xextn ) for some

finite state Markov chain Xextn and some coding function ϕ.

- Exercise 3.12. Give a proof of this fact. (Hint: What can you say about(Xn, Yn) as a process?)

Formally, we only have to study functions of Markov chains. Though,sometimes it is more convenient to keep the original HMM-representationof Yn in terms of (π, P, Π).


3.4 Functions of Markov Chains from the Ther-modynamic point of view

The principal question we will address now is whether the output processYn = ϕ(Xn), where Xn is Markov, is regular? More specifically, whetherthe law Q of Yn is a g- or a Gibbs measure. And if so, can we identifythe corresponding interaction or the g-function?

Example 3.13 (Blackwell-Furstenberg-Walters-Van den Berg). This exam-ple appeared independently in Probability Theory, Dynamical Systems,and Statistical Physics. Suppose Xn, Xn ∈ −1, 1, is a sequence ofindependent identically distributed random variables with

P(Xn = 1) = p, P(Xn = −1) = 1− p, p 6= 1/2.

Let Yn = Xn ·Xn+1 for all n. Then, Yn = ϕ(Xextn ), where Xext

n = (Xn, Xn+1)and Xext

n is Markov with

P =

p 1− p 0 0

0 0 p 1− p

p 1− p 0 0

0 0 p 1− p

.

The output process Yn takes values in −1, 1, and as we will see, themeasure Q has full support. It turns out that the process Yn has ex-tremely irregular conditional probabilities.

Let us compute probability of a cylindric event

[yn0 ] = Y0 = y0, Y1 = y1, . . . , Yn = yn, y ∈ −1, 1.

To compute Q([yn0 ]), we have to determine ϕ−1([yn

0 ]), i.e., full preimage of[yn

0 ] in the Xn-space. It is easy to see that

ϕ−1([yn0 ]) = x0 = 1, x1 = x0, x2 = y0y1, . . . , xn+1 = y0y1 · · · yn

tx0 = −1, x1 = −y0, X2 = −y0y1, . . . , xn+1 = −y0y1 · · · yn.

3.4. FUNCTIONS OF MARKOV CHAINS FROM THE THERMODYNAMIC POINT OF VIEW39

Hence,

Q([yn0 ]) = P(X0 = 1, Xk+1 = y0 · · · yk, k = 0, 1, . . . , n)

+P(X0 = −1, Xk+1 = −y0 · · · yk, k = 0, 1, . . . , n)

= p] of 1’s in (1, y0, y0y1, . . . , y0y1 · · · yn)

×(1− p)] of 1’s in (1, y0, y0y1, . . . , y0y1 · · · yn)

+p] of 1’s in (−1,−y0,−y0y1, . . . ,−y0y1 · · · yn)

×(1− p)] of 1’s in (−1,−y0,−y0y1, . . . ,−y0y1 · · · yn).

LetSn = y0 + y0y1 + · · ·+ y0y1 · · · yn = ] of 1’s− ] of (−1)’s.

But since we also have that ] of 1’s + ] of (−1)’s = n + 1, we concludethat

P([1, y0, y0y1, . . . , y0y1 · · · yn]) = p1+ n+1+Sn2 (1− p)

n+1−Sn2

and

P([−1,−y0,−y0y1, . . . ,−y0y1 · · · yn]) = pn+1−Sn

2 (1− p)1+ n+1+Sn2 .

Therefore,

Q([yn0 ]) = p1+ n+1+Sn

2 (1− p)n+1−Sn

2 + pn+1−Sn

2 (1− p)1+ n+1+Sn2

= pn+1

2 (1− p)n+1

2

[p(

p1− p

) Sn2

+ (1− p)(

1− pp

) Sn2]

.

Similarly,

Q([yn1 ]) = p

n2 (1− p)

n2

p(

p1− p

) Sn2

+ (1− p)(

1− pp

) Sn2

,

where Sn = y1 + y1y2 + · · ·+ y1y2 · · · yn. Since, Sn = y0(1 + Sn), one has

Q(y0 = 1|yn1) =

√p(1− p)

pλSn+1

2 + (1− p)λ−Sn+1

2

pλSn2 + (1− p)λ−

Sn2

=

√p(1− p)

p√

λλSn + (1−p)√λ

pλSn + (1− p)

= :

aλSn + b

cλSn + d


whereac=√

p(1− p)λ = p 6=√

p(1− p)λ

= (1− p) =bd

since p 6= 1/2.

Suppose for simplicity that λ > 1. For any yn1 , one can choose continuation

z5nn+1 such that corresponding Sn > 0 and large. Equally well one can

choose w5nn+1 s. t. Sn < 0 and |Sn| is large.

In the first case,Q(y0 = 1|yn

1 z5nn+1) '

ac

and in the second case,

Q(y0 = 1|yn1 w5n

n+1) 'bd

.

Therefore, the conditional probabilitiesQ(y0 = 1|y1y2 · · · ) are everywherediscontinuous. In some sense this is the worst possible and most irregularbehavior possible.

Example 3.14 (Jitter in Optical Storage). Baggen et al., Entropy of a bit-shiftchannel.IMS LNMS Vol. 48, p. 274–285 (2006) .

Figure 3.1: Images of DVD disks. The left image shows a DVD-ROM.The track is formed by pressing the disk mechanically onto a master disk.On the right is an image of a rewritable disk. The resolution has beenincreased to demonstrate the irregularities in the track produced by thelaser. These irregularities lead to higher probabilities of jitter errors.

Information is encoded not by the state of the disk (0/1, high/low), but inthe lengths of intervals (see Figure above). At the time the disk is read, thetransitions might be detected at wrong position due to noise, inter-symbolinterference, clock jitter, and some other distortions, the jitter being themain contributing factor.


If X is the length of a continuous interval of low or high marks on thedisk. Then, after reading, the detected length is

Y = X + ωle f t −ωright

where ωle f t, ωright take values −1, 0, 1, meaning that the transition be-tween the "low"–"high" or "high"–"low" runs was detected one time unittoo early, on time, or one unit too late, respectively

For technical reasons, the so-called run-length limited sequences are used.By definition, a (d, k)-RLL sequence has at least d and at most k 0’s between1’s. So a "high" mark of 4 units, followed by a "low" of 3 units, followed bya "high" of 4 units correspond to the RLL-sequence 100010010001 writtento the disk. (CD has a (2,10)-RLL).

Shamai and Zehavi proposed the following model of a bit-shift chanel.The signal. Let A = d, . . . , k, where d, k ∈ N, d < k and d ≥ 2. The inputspace is AZ = x = (xi) : xi ∈ A with the Bernoulli measure P = pZ,p = (pd, . . . , pk).

The jitter. Let us assume that at the left endpoint of the each interval thejitter produces an independent error with distribution

PJ(ωle f t = −1) = PJ(ωle f t = 1) = ε, PJ(ωle f t = 0) = 1− 2ε, ε > 0.

Let ωn be the sequence of these independent errors. Then If Xn was thelength of the n-the interval, then the detected length Yn is given by

Yn = Xn + ωn −ωn+1.

Since, ωextn = (ωn, ωn+1) is Markov, Yn is a Hidden Markov process or a

function of a Markov process. The corresponding measureQ does not havecontinuous conditional probabilities. The irregularity is less severe than inthe previous example, but one easily find the so-called bad configurationsfor Q. For example,

Y0 = d− 2, Y1 = . . . = Yn = d,

has effectively one preimage:

φ−1[(d− 2) d . . . d︸︷︷︸n

] ⊆ [d . . . d︸︷︷︸n+1

]× [−1 1 . . . 1︸︷︷︸n

],

and hence

Q(Y0 = d− 2, Y1 = . . . = Yn = d) ≤ pn+1d εn+1.


While [d . . . d] has large preimage, and hence large probability: one canshow that for some C > 0

Q(Y1 = . . . = Yn = d) ≥ nCQ(Y0 = d− 2, Y1 = . . . = Yn = d).

ThusQ(Y0 = d− 2|Y1 = . . . = Yn = d) ≤ C

n→ 0as n→ ∞,

and therefore, Q is not a g-measure.

The common feature of the previous two examples is the fact that theoutput process Y = Yn is a function of a Markov chain Xn withtransition probability matrix P having some zero entries. The followingresult shows that this principle cause for irregular behavior of conditionalprobabilities.

Theorem 3.15. Suppose Xn, Xn ∈ A, is a stationary Markov chains with astrictly positive transition probability matrix

P = (pi,j)i,j∈A > 0.

Suppose also that φ : A 7→ B is a surjective map, and

Yn = φ(Xn) for every n.

(i) Then the sequence of conditional probabilities

Q(y0|y1:n) = Q(Y0 = y0|Y1:n = y1:n)

converges uniformly (in y = y0:∞ ∈ BZ+) as n→ ∞ to the limit denoted by

g(y) = Q(y0|y1:∞).

Hence, Q is a g-measure.

(ii) Moreover, there exist C > 0 and θ ∈ (0, 1) such that

βn := supy0:n−1

supun:∞,vn:∞

∣∣∣g(y0:n−1un:∞)− g(y0:n−1vn:∞)∣∣∣ ≤ Cθn (3.4.1)

for all n ∈ N.

(iii) Therefore, Q (viewed as a measure on BZ) is Gibbs in the DLR-sense.

This result has been obtained in literature a number of times using differ-ent methods, depending on the method employed, the values of

θ∗ = lim supn

(βn)1/n


vary greatly.

Sufficient condition P > 0 is definitely far from being necessary. Forexample, Markus and Han extended the result of Theorem 3.15 to the casewhen every column of P is either 0 or strictly positive.

However, the problem of finding the necessary and sufficient conditionsfor functions of Markov chains to be “regular” is still open.

A rather curious observation is that in all known cases if a function ofMarkov process Yn = φ(Xn), is regular as a process, e.g., Q is a g-measure,then Q is is “highly regular”, e.g., has an exponential decay of βn.


Chapter 4

Hidden Gibbs Processes


4.1 Functions of Gibbs Processes

4.2 Yayama

Definition 4.1 (Bowen). Let Ω = AZ. A Borel probability measure P on Ωis called a Gibbs measure for φ : Ω→ R if there exists C > 0 such that

1C

<P([x0x1 · · · xn−1])

exp((Snφ)(x)− nP

) < C (4.2.1)

for every x ∈ Ω and n ∈ N, where

(Snφ)(x) =n−1

∑k=0

φ(Tkx).

Note: For every m, n ≥ 1 and all x.

(Sn+mφ)(x) = (Snφ)(x) + (Smφ)(Tnx).

The sequence of functions φn(x) := (Snφ)(x), n ≥ 1, satisfies

φn+m(x) = φn(x) + φm(Tnx).

Definition 4.2. A sequence of continuous functions φn∞n=1 is called

45

46 CHAPTER 4. HIDDEN GIBBS PROCESSES

• almost additive if there is a constant C ≥ 0 such that for everyn, m ∈ N and for every x∣∣∣φn+m(x)− φn(x)− φm(Tnx)

∣∣∣ ≤ C.

• subadditiveφn+m(x) ≤ φn(x) + φm(Tnx)

• asymptotically subadditive if for any ε > 0 there exists a subaddi-tive sequence ψn∞

n=1 on X such that

lim supn→∞

(1/n) supx∈X|φn(x)− ψn(x)| ≤ ε.

Definition 4.3. Suppose φn∞n=1 is an asymptotically subadditive se-

quence of continuous functions. A Borel probability measure P is called aGibbs measure for φn if there exist C > 0 and P ∈ R such that

1C

<P([x0x1 · · · xn−1])

exp(φn(x)− nP

) < C (4.2.2)

for every x and all n ∈ N.

The sequence φn is said to have bounded variation if there exists a con-stant M > 0 such that

supn

supx,x

|φn(x)− φn(x)| : xn−1

0 = xn−10≤ M.

Theorem 4.4. Suppose φ = φn is an almost additive sequence with boundedvariation. Then there exists a unique invariant Gibbs measure Pφ for an φ; Pφ

is mixing. Moreover, Pφ is also a unique equilibrium state for φ:

h(Pφ) + limn→∞

1n

∫φndPφ = sup

Q

h(Q) + lim

n→∞

1n

∫φndQ

.

Theorem 4.5 (Yuki Yayama). Suppose P is the unique invariant Gibbs measureon AZ for the almost additive sequence φ = φn with bounded variation, and

Q = π∗P = P π−1

where π : AZ 7→ BZ is the the single-site factor map.

Then Q is the unique invariant Gibbs measure for the asymptotically subaddi-tive sequence ψ = ψn with bounded variation, where

ψn(y) = ∑xn−1

0 ∈π−1(yn−10 )

supx∗∈[xn−1

0 ]

exp(φn(x∗)).

Remark. Almost Additive ⊆ Asymptotically Subadditive

4.3. REVIEW: RENORMALIZATION OF G-MEASURES 47

4.3 Review: renormalization of g-measures

Theorem 4.6. Suppose P is a g-measure with βn = varn(g) → 0 sufficientlyfast, then Q = P π−1 is a g-measure, i.e., ν(y0|y∞

1 ) = g(y) and

βn = varn(g)→ 0.

Denker & Gordin (2000) βn = O(e−αn) ⇒ βn=O(e−αn)

Chazottes & Ugalde (2009) ∑n n2βn < ∞ ⇒ ∑n βn < ∞

Kempton & Pollicott (2009) ∑n nβn < ∞ ⇒ ∑n βn < ∞

Redig & Wang (2010) ∑n βn < ∞ ⇒ βn → 0

V. (2011) ∑n βn < ∞ ⇒ βn → 0

βn ≈ e−αn, then βn ≈ e−α√

n [CU,KP], ≈ e−αn [DG, RW].

βn ≈ n−α, then βn ≈ n−α+2 [CU,RW], and ≈ n−α+1 [KP].

OPEN PROBLEM: Identify a class of (generalized) Gibbs measures closedunder renormalization transformations – single site/one-block factors.

4.4 Cluster Expansions

Q(yn−10 ) = ∑

xn−10 ∈π−1yn−1

0

P(xn−10 ).

logQ(yn−10 ) = log

(∑

xn−10 ∈π−1yn−1

0

P(xn−10 )

).

Example: Binary symmetric channel with the parameter ε:

logQ(yn0) = log

(∑

xn0∈π−1yn

0

P(xn0 )(1− ε)|i:yi=xi|ε|i:yi 6=xi|

)

Emission probabilities satisfy

πxi(yi) = Prob(Yi = yi|Xi = xi) = (1− ε)I[xi = yi] + ε(1− I[xi = yi])

= (1− ε)(

q + (1− q)I[xi = yi])

.


Thus

Q(yn0) = (1− ε)n+1 ∑

xn0

P(xn0 )

n

∏i=0

(q + (1− q)I[xi = yi])

= (1− ε)n+1 ∑xn

0

P(xn0 ) ∑

A⊆[0,n]∏i∈A

(1− q)I[xi = yi] ∏i 6∈A

q

= (1− ε)n+1 ∑A⊆[0,n]

(1− q)|A|qn+1−|A|P(xn0 : xi = yi ∀i ∈ A)

where we used

∑xn

0

P(xn0 ) ∏

i∈AI[xi = yi] = P(xn

0 : xi = yi ∀i ∈ A).

Q(yn0) = (1− ε)n+1 ∑

A⊆[0,n](1− q)|A|qn+1−|A|P(xn

0 : xi = yi ∀i ∈ A)use substitution B = Ac = [0, n] \ A

= (1− ε)n+1 ∑

B⊆[0,n](1− q)n+1−|B|q|B|P(xn

0 : xi = yi ∀i ∈ Bc)

= [(1− ε)(1− q)]n+1 ∑B⊆[0,n]

( q1− q

)|B|P(xn

0 : xi = yi ∀i ∈ Bc)

= (1− 2ε)n+1P(yn0) ∑

B⊆[0,n]

( q1− q

)|B|P(xn0 : xi = yi ∀i ∈ Bc)

P(yn0)

= (1− 2ε)n+1P(yn0) ∑

B⊆[0,n]

( q1− q

)|B| P(yBc)

P(yByBc)

= (1− 2ε)n+1P(yn0) ∑

B⊆[0,n]

( q1− q

)|B| 1P(yB|yBc)

Q(yn0) = (1− 2ε)n+1P(yn

0) ∑B⊆[0,n]

( q1− q

)|B| 1P(yB|yBc)

= (1− 2ε)n+1P(yn0)

(1 + ∑

∅ 6=B⊆[0,n]

( q1− q

)|B| 1P(yB|yBc)

).

Suppose P is MARKOV B = I1 t . . . t Ik

( q1− q

)|B| 1P(yB|yBc)

=k

∏j=1

( q1− q

)|Ij| 1P(yIj |yIc

j ∩[0,n])

4.5. CLUSTER EXPANSION 49

Q(yn0) = (1− 2ε)n+1P(yn

0)

(1 + ∑

I1t...tIk⊆[0,n]zI1 . . . zIk

)

Cluster expansions Object: partition function given as

ΞP (z) = 1 + ∑k≥1

1k! ∑

(γ1,...,γk)∈P k

φ(γ1, . . . , γk)zγ1 . . . zγk ,

whereP = set of polymers,

z = zγ ∈ C |γ ∈ P = fugacities or activities,φ(γ1) = 1,

φ(γ1, . . . , γk) = ∏i 6=jI[γi ∼ γj],

∼ is a relation on P .

4.5 Cluster expansion

Theorem. If

ΞP (z) = 1 + ∑k≥1

1k! ∑

(γ1,...,γk)∈P k

φ(γ1, . . . , γk)zγ1 . . . zγk ,

thenlog ΞP (z) = ∑

k≥1

1k! ∑

(γ1,...,γk)∈P k

φT(γ1, . . . , γk)zγ1 . . . zγk ,

where φT(γ1, . . . , γk) are the truncated function of order k.

Convergence takes place on the polydisc

D = z : |zγ| ≤ ργ, γ ∈ P.

Truncated functions φT(γ1, . . . , γk)

For k = 1, φT(γ1) = 1.

Suppose k > 1. Define the graph

G = Gγ1,...,γk = (V(G), E(G)),V(G) = 1, . . . , k,E(G) =

[i, j] : γi γj, 0 ≤ i, j ≤ k

If Gγ1,...,γk is disconnected, then φT(γ1, . . . , γk) = 0.


If Gγ1,...,γk is connected,

φT(γ1, . . . , γk) = ∑G⊂Gγ1,...,γkG conn. spann.

(−1)#E(G),

where G ranges over all connected spanning subgraphs of G, and E(G) isthe edge set of G.

Definition 4.7. If γ1, γ2 are two intervals in Z+, we say that γ1 and γ2 arecompatible (denoted by γ1 ∼ γ2), if γ1 ∪ γ2 is not an interval.

Lemma 4.8.

1+ ∑∅ 6=B⊆[0,n]

( q1− q

)|B| 1P(yB | yBc)

= 1 +∞

∑k=1

1k! ∑

(γ1,...,γk)∈P k[0,n]

φ(γ1, . . . , γk)z[0,n]γ1 . . . z[0,n]

γk ,

where the set of polymers P[0,n] is the collection of subintervals of [0, n], φ(γ1) =1, and φ(γ1, . . . , γk) = ∏i 6=j I[γi ∼ γj], fugacities are given by

z[0,n][a,b] =

(ε

1− 2ε

)b−a+11

P(y[a,b]|ya−1,b+1∩[0,n]

) .

Corollary. For all yn0 , one has

logQ(y0|y[1,n]) = log py0,y1 + log(1− 2ε)

+∞

∑k=1

1k!

[∑

(γ1,...,γk)∈P k[0,n]

∪iγi∩0,16=∅

φT(γ1, . . . , γk)z[0,n]γ1 . . . z[0,n]

γk

− ∑(γ1,...,γk)∈P k

[1,n]∪iγi∩16=∅

φT(γ1, . . . , γk)z[1,n]γ1 . . . z[1,n]

γk

]

= log py0,y1+ log(1− 2ε) +∞

∑m=1

(ε

1− 2ε

)m ∞

∑k=1

Cn,m,k(y[0,n])

Observation 1: Cn,m,k = 0 for k > m.

Observation 2: If m < n, then Cn,m,k depends only y[0,m+1].

4.5. CLUSTER EXPANSION 51

logQ(y0|y[1,n]) = log py0,y1 + log(1− 2ε)

+∞

∑m=1

(ε

1− 2ε

)m ∞

∑k=1

Cn,m,k(y[0,n])

Observation 1: Cn,m,k = 0 for k > m.

Observation 2: If m < n, then Cn,m,k depends only y[0,m+1].

C2,1,1(y[0,2]) =1

P(y0|y1)+

1P(y1|y0, y2)

− 1P(y1|y2)

C3,2,1(y[0,3]) =1

P(y0, y1|y2)+

1P(y1, y2|y0, y3)

− 1P(y1, y2|y3)

C3,2,2(y[0,3]) = −1

P(y0|y1)2 −1

P(y0|y1)P(y1|y0, y2)− 1P(y1|y0, y2)2

+1

P(y1|y2)2 −1

P(y1|y0, y2)P(y2|y1, y3)+

1P(y1|y2)P(y2|y1, y3)

.

Theorem 4.9 (Fernandez-V.). (i) The expansions

logQ(y0|yn1) =

∞

∑k=0

f (n)k (yn0) εk. (4.5.1)

converge absolutely and uniformly on the complex disk |ε| ≤ r with

r1− 2r

≤ p4

. (4.5.2)

(ii) Uniformly on this disc

limn→∞

logQ(y0|yn1) = log g(y∞

0 ) =∞

∑k=0

fk(y∞0 ) εk . (4.5.3)

(iii) For every k and all n ≥ k + 1

f (n)k (yn0) = fk(yk+1

0 ) . (4.5.4)


Chapter 5

Hidden Gibbs Fields

Part2 :General Theory of Hidden Gibbs Models :

The observable process is a function, either deterministic or random,

of the underlying “signal” process, governed by Gibbs distribution.

Random processes d = 1

Xnn∈Z → Ynn∈Z

Random fields d > 1

Xnn∈Zd → Ynn∈Zd

Many natural questions

• Is the image of a Gibbs measure again Gibbs? (⇔ Is Yn Gibbs?)

• What are the probabilistic properties of images Gibbs measures?

• If the image measure is indeed Gibbs, can we determine the corre-sponding interaction?

• Given the output Yn, what could be said about Xn?

• Given Yn, either Gibbs or not, can we efficiently approximate it bya Gibbs process?

Dimension d = 1:It has been known for a long time, that functions of Markov (hence, alsoof Gibbs) process can be “irregular”:

Yn = XnXn+1, Xn ∼ Ber (p) , Xn ∈ −1, 1.

53

54 CHAPTER 5. HIDDEN GIBBS FIELDS

Here you can “attribute” singularity to the fact that Xextn = (Xn, Xn+1) is

not fully supported.Functions of fully supported Markov chains (⇔ P > 0) are Gibbs.

In dimension d > 1 (realm of Statistical Physics), one typically works withfully supported Gibbs state.It came as a surprise that functions of simple fully supported Markovrandom fields (like Ising Model) can be non-Gibbsian.History is well documented : see papers (=books) by van Enter, Fernandez,Sokal, Le Ny, Bricmont, Kupiainen, Lefevere, . . .

5.1 Renormalization Transformations in Statisti-cal Mechanics

Textbook examples of RG transformations:

• Random field σnn∈Zd , lattice L = Zd.

• Cover lattice Zd by boxes Bxx∈L, L ⊆ L, which are disjoint/non-overlapping.

For every box Bx, let σx = F(σn|n ∈ Bx) random or deterministic.

Examples:

• Decimation.

σn = σ2n.

• Majority rule: σn ∈ ±1, σx = majority of σn’s, n ∈ Bx.

5.2. EXAMPLES OF PATHOLOGIES UNDER RENORMALIZATION 55

• Kadanoff’s copying with noise

P(σx = 1|σn, n ∈ Bx) =1Z

exp

(β ∑

n∈Bx

σn

),

P(σx = −1|σn, n ∈ Bx) =1Z

exp

(−β ∑

n∈Bx

σn

).

• Spin flip dynamics (Glauber spin flips)

– Each n ∈ Zd, has an independent Poissonian clock with inten-sity λ.

– Every time the clock rings, the spin is flipped σn → σn.

=⇒ Continuous time Markov process on ±1Zd.

P0 −→ Pt = StP0.

But for fixed t > 0, P0St−→ Pt, where

(Stσ)n =

σn, w.p. 1− εt,σn, w.p. εt, .

Exercise: Show that εt =12(1− e−2λt), ∀t ≥ 0.

5.2 Examples of pathologies under renormaliza-tion

L = Zd =⋃

x∈LBx,

σ ∈ AZd= AL −→ σ ∈ BL.

Physical point of view

exp(−βH(σ)) = ∑σ

σ exp(−βH(σ)).


E.g.

H = ∑ Jiσi + ∑i,j

Jijσiσj + ∑i,j,k

Jijkσiσjσk,

H = ∑ Jiσi + ∑i,j

Jijσiσj + · · · .

Try to define: J = (Ji, Jij, . . . )→ J = ( Ji, Jij, . . . )

• It was implicitly assumed that renormalization in always successful:

∃R : Potentials −→ Potentials.

• First examples where some "difficulties"/pathologies were observed,appeared in early 1980’s.

• Soon it was explained that the source of problems is the lack ofquasi-locality.

As we have seen, by Kozlov-Sullivan’s Theorem

P is Gibbs ⇐⇒ P is consistent with some specification γ = γΛ(·|·)ΛbL which is

• uniformly non-null: ∀Λ b L, ∃0 < αΛ < βΛ < 1 such that

0 < αΛ ≤ γΛ(σΛ|σΛc) ≤ βΛ < 1.

• quasilocal (regular):∀Λ b L,

supσ,η,ξ|γΛ(σΛ|σW\ΛηWc)− γΛ(σΛ|σW\ΛξWc)| −→ 0 as W L.

In principle, both properties can fail under renormalization. However,suppose P ∈ G(γ),P is a measure on AZ

d. A Function ϕ : A → B (onto),

extends to AZd → BZ

d. Let Q = ϕ∗P = P ϕ−1. Then ∀Λ b Zd :

Q(σΛ) = ∑σΛ∈ϕ−1(σΛ)

P(σΛ),

and

Q(σ0|σΛ\0) =∑σΛ∈ϕ−1(σΛ)

P(σΛ)

∑σΛ\0∈ϕ−1(σΛ\0)P(σΛ\0)

= ∑σΛ∈ϕ−1(σΛ)

P(σ0|σΛ\0)

( P(σΛ\0)

∑σΛ\0/∈ϕ−1(σΛ\0)P(σΛ\0)

),

Q(σ0|σΛ\0) = ∑σΛ∈ϕ−1(σΛ)

P(σ0|σΛ\0)W(σΛ\0),


• P(σ0|σΛ\0) converge to the limit ∈ [α0, β0],

• W(σΛ\0) are probability weights.

=⇒ Q inherits uniform non-nullness from P.

Conclusion: Thus it is only possible that quasi-locality fails: Q is notconsistent with any quasi-local specification.

Absence of quasi-locality

Definition: A measure P on Ω is not quasi-local at ω ∈ Ω if there existsΛ b L and quasi-local f : Ω → R such that no "version"/realization ofEP( f |FΛc)(·) is quasi-local at ω. (Finite spin space Ω = AL, quasi-localityis equal to continuity in product topology.)

Problem: EP( f |FΛc)(·) are defined P-a.s. It seems impossible to check"all" versions.

Remark: Continuity means Feller property: γΛ( f ) ∈ C(Ω) for all f ∈C(Ω).

Theorem 5.1 (Existence of a bad configuration). Suppose P is uniformly non-null. A point ω ∈ Ω is called a bad point if ∃δ > 0, ∀Λ b Zd, ∃V b Zd, Λ ⊂ V,and η, ξ ∈ Ω such that

µ(ω0|ωΛ\0ηV\Λσ+Vc)− µ(ω0|ωΛ\0ξV\Λσ−Vc) ≥ δ

for all σ+, σ− ∈ Ω. If ω ∈ Ω is a bad point, then P is not quasi-local.

Remark: Existence of a bad configuration for P is "stronger" than non-quasilocality of P. But in all known cases, non-quasilocality of P has beenestablished by presenting bad ω.

How to identify bad points?

Hidden phase transitions scenario:(Van Euter Fernandez Sokal) "Loss ofGibbsianity occurs when there is a hidden phase transition in the originalsystem conditioned on the image spins."

• σ ∈ Ω image configuration Xσ = ϕ−1(σ) ⊆ Ω fibre over σ.

• If |GXσ(φ)| = 1, ∀σ ∈ Ω, then ν = ϕ∗µ is Gibbs.

• If |GXσ(φ)| > 1, for some σ, then ν = ϕ∗µ is not Gibbs.


• Hypothesis is true in all known cases!

• Phase transitions: e.g. Ising-type models.

∃βc = βc(φ) : ∀β < βc, |G(βφ)| = 1,∀β > βc, |G(βφ)| > 1.

• Take β < βc =⇒ there is a unique phase.

Conditioning on image spins = lowers the temperature beyond criticalvalue |GXσ

(βφ)| ≥ 2.

Examples

• Decimation of the Ising Model

H = −J ∑i∼j

σiσj

Theorem 5.2. Let P be any Gibbs measure for the 2D Ising model with J >12 cosh−1(1 +

√2) ' 0.769 . . . and 0 magnetic field. Then Q = T∗2P is not

consistent with any quasi-local specification.

Kadanoff’s transformation:

σnn∈Zd → σnn∈Zd

Π(σn = 1 | σn) =epσn

2 cosh(pσn), Π(σn = −1 | σn) =

e−pσn

2 cosh(pσn)

Theorem 5.3. Suppose d > 1, and 0 < p < ∞. Then ∃J0 = J0(d, p):

∀J > J0, any Gibbs measure for H = −J ∑i∼j

σiσj P, renormalized measure Q is not Gibbs.

Hidden Markov Random Field→ Markov Random Field

→ BSC ε =1

1 + e2p


Repeated application of Kadanoff’s transformations:

P[σn = σn] = 1− ε, P[σn 6= σ] = ε.

Potts Model:

Ω = 1, . . . , qZd, A = 1, . . . , q, q-spins / colors

γΛ(σΛ | σΛc) =1z

exp

2β ∑i∈Λj∼i

I[σi = σj]

Phase Diagram (Aizenmann-Chayes-Chayes-Newman)

There exists βc = βc(q, d) ∈ (0,+∞):

• ∀β < βc (High Temperature) ∃! unique Gibbs state

• ∀β > βc (Low Temperature) ∃ multiple Gibbs states.

Fuzzy Potts Model: (Maes, Vande Velde, Haggstrom)

ϕ : 1, . . . , q → 1, . . . , s, s < q

σn = ϕ(σn) =

1, σn ∈ 1, . . . , r12, σn ∈ r1 + 1, . . . , r1 + r2...s, σn ∈ r1 + · · ·+ rs−1 + 1, . . . , r1 + · · ·+ rs = q

Agreement: r1 = minri | i > 1.

Theorem 5.4. Q = ϕ∗P = P ϕ−1 (P = G(βφ)) is

• Gibbs, if β < βc(r1, d)

• not Gibbs, if β > β1 = 12 log 1+(r1−1)pc(d)

1−pc(d)


5.3 General Properties of Renormalised GibbsFields

First Fundamental Theorem

Theorem 5.5. If P1,P2 ∈ GS(φ) (Gibbs for the same φ), then Q1 = T∗P1,Q2 =T∗P2 are either

• Both non-Gibbs

• or, Gibbs for the same interaction ψ.

⇒ Single-valuedness of RG transformation.

• Acts on interactions / Hamiltonians, not “measures”.

• Excludes possibility Q1 ∈ G(ψ1), Q2 ∈ G(ψ2): g(ψ1) 6= g(ψ2).

RG Relation:

RT = (φ, ψ) : ∃µ ∈ GS(φ), ∃ν ∈ GS(ψ) : ν = T∗µ

Second Fundamental Theorem:

If (φ0, ψ0) ∈ R, (φ1, ψ1) ∈ RT, then

|||ψ0 − ψ1||| ≤ K|||φ0 − φ1|||,

where ||| · ||| is the norm / modulo physical equivalence.

⇒ Continuity of RG maps.

5.4 Dobrushin’s reconstruction program

• Classify possible singularities of renormalized Gibbs states.

• Try to recover some basic thermodynamic results (e.g. VariationalPrinciple).

5.4. DOBRUSHIN’S RECONSTRUCTION PROGRAM 61

5.4.1 Generalized Gibbs states/“Classification” of singu-larities

Reminder: P is Gibbs if P is consistent with quasi-local specification Γ:(P ∈ G(Γ)):

∀Λ : supω,η| γΛ(ωΛ | ωΛn\ΛηΛc

n)−γΛ(ωΛ | ωΛn\ΛωΛcn) | → 0 asn→ ∞

Specification Γ = γΛ might not be quasi-local:

ΩΓ := ω : ∀Λ supη| γΛ(ωΛ | ωΛn\ΛηΛc

n)−γΛ(ωΛ | ωΛcn) | → 0 asn→ ∞

Definition: P is almost Gibbs if P is consistent with specification Γ: (P ∈G(Γ)) and

P(ΩΓ) = 1.

Gibbs specifications are given by

γΛ(ωΛ | ωΛc) = γφΛ(ωΛ | ωΛc) =

1z

exp(−HφΛ(ωΛωΛc))

φ is uniformly absolutely convergent⇒ Γφ = γφΛ is quasi-local.

HamiltonianHφ

Λ(ω) = ∑V∩Λ 6=∅

φ(V, ωV)

might not converge everywhere.

Ωφ = ω : HφΛ(ω) converges ∀Λ.

Definition: P is called weakly Gibbs for φ if

P(Ωφ) = 1 and P ∈ G(Γφ).

Almost Gibbs: quasi-locality a.e.

Weakly Gibbs: convergence a.e.

Definition: P is called intuitively weakly Gibbs for φ if P is weakly Gibbsfor φ:

• P(Ωφ) = 1, HΛ converges on Ωφ and P ∈ G(Γφ).

and


• Ωregφ ⊆ Ωφ, P(Ωreg

φ ) = 1:

γφΛ(ωΛ | ωΛn\ΛηΛc

n)→ γφΛ(ωΛ | ωΛc)

for all ω, η ∈ Ωregφ .

Theorem 5.6. G ( AG ( IWG ⊆WG.

Open problem: IWG ( WG.

γuΛ(ωΛ | ωΛn\ΛηΛc

n)→ γuΛ(ωΛ | ωΛc)

• Gibbs: for all ω and for all η.

• Almost Gibbs: for a.a. ω and for all η.

• Intuitively weak Gibbs: for a.a. ω and for a.a. η.

Definition: Let Ω = AZd. A function f : Ω → R is called quasi-local at

ω ∈ Ω in the direction θ ∈ Ω if f (ω) = limΛZd f (ωΛθΛc).

Remark: f is quasi-local at ω ⇐⇒ f is quasi-local in all directions θ ∈ Ω.

Specification Γ = γΛ(·|·) is called quasi-local at ω ∈ Ω in the directionθ ∈ Ω if ∀Λ, γΛ(ωΛ|ωΛc) = fΛ(ω) is quasi-local at ω in the direction θ.

⇐⇒ γΛg is quasi-local at ω in the direction θ (∀Λ, ∀ quasi-local g).

Consider the following sets:

Ωθqloc(Γ) = Ωθ

q(Γ) = ω ∈ Ω : Γ is quasi-local at ω in the direction θ,Ωqloc(Γ) = Ωq(Γ) = ω ∈ Ω : Γ is quasi-local at ω.


Proposition 5.7. • If P is consistent with a specification Γ and ∃θ ∈ Ω suchthat P(Ωθ

q(Γ)) = 1. Then P is weakly Gibbs.

• If P is intuitively weak Gibbs, then it is consistent with specification Γ :

P[θ ∈ Ω : P(Ωθ

q(Γ)) = 1]= 1.

• If P is almost Gibbs, then it is consistent with specification Γ :

P(Ωq(Γ)) = 1.

Möbius Transform: E is a finite set. Two collections

F = FAA⊆E , G = GAA⊆E .

Then the following are equivalent:

• ∀A : GA = ∑B⊆A FB.

• ∀A : FA = ∑B⊆A(−1)|A\B|GB.

• Gibbs specifications:

γ(ωΛ|σΛc) =1

Z(σΛc)exp(−HΛ(ωΛσΛc))

=⇒ γΛ(ωΛ|θΛc)

γΛ(θΛ|θΛc)= exp(−[HΛ(ωΛθΛc)− HΛ(θΛθΛc)])

• Fix θ ∈ Ω.

∑V∩Λ 6=∅

φV(ωΛθΛc) = HΛ(ωΛθΛc)

= − logγΛ(ωΛ|θΛc)

γΛ(θΛ|θΛc).

Vacuum Potential:

φθΓ(A, ω) = − ∑

Λ⊆A(−1)|A\Λ| log

γΛ(ωΛ|θΛc)

γΛ(θΛ|θΛc)

⇐⇒ logγΛ(ωΛ|θΛc)

γΛ(θΛ|θΛc)= − ∑

A⊆Λφθ

Γ(A, ω).

−HθΛ(ω) = − ∑

A∩Λ 6=∅φθ

Γ(A, ω)

= − ∑A⊂Λ

φθΓ(A, ω)− ∑

A∩Λ 6=∅,A∩Λc 6=∅φθ

Γ(A, ω),

HθΛ(ωΛθΛc) = − ∑

A⊂ΛφΓ(A, ωA) + 0.


Definition: φ = φ(Λ, ·) has vacuum θ ∈ Ω if φ(A, ωA) = 0 if ωi = θifor some i ∈ A. Therefore in the sum above:

∑A∩Λ 6=∅,A∩Λc 6=∅

φ(A, ωAθAc) = 0.

Exercise: φθΓ(A, ω) := −∑Λ⊆A(−1)|A\Λ| log γΛ(ωΛ|θΛc )

γΛ(θΛ|θΛc )has vacuum θ.

φθΓ(A, ω) = − ∑

Λ⊆A(−1)|A\Λ| log

γΛ(ωΛ|θΛc)

γΛ(θΛ|θΛc)

is the desired vacuum potential.

But

• Potential might not be convergent.

• Even if it is convergent, it might not be consistent with Γ.

Proposition 5.8. Vacuum potential φθΓ is convergent in ω and consistent with

Γ in ω if and only if ω ∈ ΩθΓ.

∑A⊆L,A∩Λ 6=∅

φθΓ(A, ω) = − log

γΛ(ωΛ|ωL\ΛθLc)

γΛ(θΛ|ωL\ΛθLc).

5.4.2 Variational Principles

Definition: Suppose P, Q are translation invariant measures on Ω = AZd.

Then the relative entropy density rate

h(Q|P) = limn→∞

1|Λn| ∑

ω∈AΛn

Q(ω) logQ(ω)

P(ω),

provided the limit exists.

Remark: If P is Gibbs for UAC potential φ, the followings hold:

Theorem 5.9. then h(Q|P) exists for any Q ∈ MS(Ω).

Theorem 5.10 (Variational Principle). If P ∈ G(φ) is a translation invariantmeasure, and Q is a translation invariant measure, then

h(Q|P) = 0⇔ Q ∈ G(φ).


h(Q|P) = limΛnZd

1|Λn| ∑

ωΛn

Q(ωΛn) logQ(ωΛn)

P(ωΛn)

= −h(Q) limΛnZd

1|Λn| ∑

ωΛn

Q(ωΛn) logP(ωΛn).

For Gibbs P:h(Q|P) = P( fφ)− h(Q)−

∫fφdQ ≥ 0.

Variational Principles for Generalized Gibbs states

Theorem 5.11. Suppose h(Q|P) = 0. Then

Q(σo|σoc) = P(σo|σoc) (Q-a.s.)

But what does it mean for general Q and P, which are singular.

Q(σo|σoc) P(σo|σoc)

defined Q-a.s. defined P-a.s.

One can conclude very little if, for example, P is not regular in some senseand Q “sits” on a set of P-regular points.

Side remark: ∃P:h(Q|P) = 0 ∀Q.

Category of Weakly Gibbsian Measures you can’t do anything “sensible”:

Theorem 5.12. There exist weakly Gibbsian measures P1, P2:

h(P1|P2) = h(P2|P1) = 0,

butP1 /∈ G(Γ2) and P2 /∈ G(Γ1),

where Γi is the W.-G. specification for Pi, i = 1, 2.

Category of Almost Gibbs Measures:


Theorem 5.13. P is almost Gibbs for Γ: P(ΩΓ) = 1. Q is concentrated on ΩΓ:Q(ΩΓ) = 1. Then

h(Q|P) = 0⇒ Q ∈ G(Γ),and hence Q is almost Gibbs as well.

Theorem 5.14. P ∈ G(Γ), then

ΩΓ = ω : µ(σΛ|ωΛn\Λ)→ γΛ(σΛ|ωΛc)

has P-measure 1. If

h(Q|P) = 0 and Q(ΩΓ) = 1,

then Q ∈ G(Γ).

This implies the variational principle for intuitively weakly Gibbsian states.

Theorem 5.15. Suppose P ∈ G(φ) on Ω = AZd, and Q = ϕ∗P = P ϕ−1

where ϕ : AZd → BZ

dis a single-site renormalization transformation. Then Q

might be non-Gibbs, but

• h(Q|Q) exists for any Q ∈ Ms(BZd),

• if h(Q|Q) = 0, then there exists P ∈ G(φ) : Q = ϕ∗P = P ϕ−1.

Complete recovery of Variational Principle for Gibbs states in case ofHidden Gibbs states.

The key is the following result : if Q = ϕ∗P,

h(Q|Q) = infϕ∗P=Q

h(P|P),

and the infimum is achieved.

h(Q|Q) = 0 ⇒ h(P|P) = 0 ⇒ P is Gibbs.

Theorem 5.16 (First Fundamental Theorem). If P1,P2 ∈ G(φ) on AZd, ϕ :

AZd → BZ

dsingle-site, then Q1 = ϕ∗P1, Q2 = ϕ∗P2 are either both Gibbs for

the same potential ψ, or non-Gibbs.

0 ≤ h(Q1|Q2) ≤ h(P1|P2) = 0 ⇒ h(Q1|Q2) = h(Q2|Q1) = 0.

If, say Q2 is Gibbs, then h(Q1|Q2) = 0 implies that Q1 is Gibbs for the samepotential.


Corollary 5.17 (Corollary of the First Fundamental Theorem). Supposeϕ∗ : G

AZd (φ)→ GBZd (ψ), P 7→ Q, then ϕ∗(GAZd (φ)) ⊆ G

BZd (ψ).

Variational Principle tells us that

ϕ∗(GAZd (φ)) = GBZd (ψ) (onto).

“Conjecture": Modulo trivial counter-examples

ϕ∗ : GAZd (φ)→ G

BZd (ψ)

is a bijection.

RG relation :

Rϕ =

(φ; ψ) :

φ is UAC interaction on AZd,

ψ is UAC interaction on BZd,

∃P ∈ G (φ) ,Q ∈ G (ψ) : Q = ϕ∗P

It is a relation by First Fundamental Theorem.

Rϕ(ψ) = φ : (φ, ψ) ∈ Rϕ fibre over ψ.

Rϕ(ψ) is typically non-trivial : ϕ is many-to-one.

Theorem 5.18. Ψ1, Ψ2 are uniformly absolutely convergent interactions onBZ

d, then Rϕ(ψ1) and Rϕ(ψ2) are isomorphic. If (Φ1, Ψ1) ∈ Rϕ, then Φ2 =

Φ1 − ψ1 ϕ + ψ2 ϕ ⇒ (φ2, ψ2) ∈ Rϕ. (φ1, ψ1) ∈ Rϕ ⇒ φ2 =φ1 − ψ1 π + ψ2 π : (φ2, ψ2) ∈ Rϕ.

Theorem 5.19 (Second Fundamental Theorem). If ϕ(φ1) = ψ1, ϕ(φ2) = ψ2,then |||ψ1 − ψ2||| ≤ K · |||φ1 − φ2|||.


Claim: K = 1.What is ||| · |||? H is a supremum norm on fφ / modulo physical equiva-lence. Suppose φ = φ(Λ, ·)ΛbZd is UAC interaction, and φ = φ(Λ, ·) +cΛΛ∈Zd is also UAC, then G(φ) = G(φ) (since γ

φΛ = γ

φΛ ∀Λ).

|||φ1 − φ2||| = infg is coboundary, c is constant

|| f φ1 − f φ2 + g− c||.

After last slides

Definition: f : AZd → R, Λn := [0, n− 1]d ∩Zd, T : AZ

d → AZd,

P( f ) := limn→∞

1|Λn|

log ∑ωΛn∈AΛn

exp

supω : ωΛ=ωΛω=ωΛ1Λcω=ωΛηΛc

∑k∈Λn

f (Tkω)

.

Theorem 5.20 (Weak Variational Principle). f ∈ C(AZd),

P( fΦ) = sup[

h(P) +∫

fΦdP]

.

P is an equilibrium state if P( fΦ) = h(P) +∫

fΦdP.

(1)

P( f |P) = limn→∞

1|Λn|

log∫

exp

(∑

k∈Λn

f (Tkω)

)P(dω),

provided the limit exists.

— P is Gibbs, P ∈ G(Φ).

log ∑ωΛn∈AΛn

exp

∑k∈Λn

ωΛn=ωΛn

f (Tkω)

P(ωΛn)

= log ∑ωΛn∈AΛn

exp

(∑

k∈Λn

f (Tkω)

)exp

(∑

k∈Λn

fΦ(Tkω)− |Λn|P( fΦ)

)

= log ∑ωΛn∈AΛn

exp

(∑

k∈Λn

f (Tkω) + ∑

k∈Λn

fΦ(Tkω)− |Λn|P( fΦ)

).

P( f |P) = P( f + fΦ)− P( fΦ).


(2)

h(Q |P) = P( fΦ)− h(Q)−∫

fΦ dQ.

Theorem 5.21.

P( f |P) = supQ

[∫f dQ− h(Q |P)

],

h(Q |P) = supf∈C(AZd )

[∫f dQ− P( f |P)

].

η ∈ BZd, Xη = ϕ−1(η) = ω : ϕ(ω) = η = ∏k∈Zd ϕ−1(ηk) j A,

P( f , ϕ)(η) = P( f |Xη) := limn→∞

1|Λn|

log ∑ωΛn : ϕ(ωΛn )=ηΛn

exp

(sup ∑

k∈Λn

f (Tkω)

).

AZd, ϕ : A→ 1, AZd → 1Z

d,∫

: BZd,

P( · , ϕ)(η) : C(AZd)→ R

convex real valued f.

|P( f1, ϕ)(η)− P( f2, ϕ)(η)| 5 ‖ f1 − f2‖0.

P( f , ϕ)(η) = P( f , ϕ)(Skη)

Theorem 5.22. ϕ : AZd → BZ

d, f ∈ C(AZ

d), Q ∈ MS(BZd). Then

h(Q) +∫BZd

P( f , ϕ)(y)Q(dy) = supQ=ϕ∗P

h(P) +

∫AZd

f dP

Theorem 5.23.

P ∈ ESAZd ( f )⇐⇒ Q = ϕ∗P ∈ ESBZd (P( f , π)),

and P is a relative equilibrium state for f over Q.

Suppose Q = ϕ∗P, P ∈ G(Φ) = ES( fΦ),

h(Q) +∫

P( fΦ, ϕ) dQ = sup[

h(Q) +∫

P( fΦ, ϕ) dQ]

.

Suppose Q ∈ G(Ψ) = ES( fΨ),

h(Q) +∫

fΨ dQ = sup[

h(Q) +∫

fΨ dQ]

.

Heref ∗Ψ

(ω) = limn→∞

1|Λn| ∑

k∈Λn

fΨ(Skω),

limn→∞

1|Λn| ∑

k∈Λn

fΨ(Skω) = P( fφ, π), Q-a.s. for every ergodic Q.


5.5 Criteria for Preservation of Gibbs propertyunder renormalization

Suppose P ∈ G(ψ), Q = ϕ∗P ∈ G(ψ), where ϕ : AZd → BZ

dis a single-site

transformation.

Q = ϕ∗P = P ϕ−1 ⇔ Q(yΛ) = ∑xΛ∈ϕ−1(yΛ)

P(xΛ).

Q(yΛn) = ∑xΛn∈ϕ−1(yΛn )

P(xΛn).

• If P is Gibbs, then

P(xΛn) = exp((SΛn fϕ(x))− |Λn|P( fϕ)

)∗ DΛn(x).

• If Q is Gibbs, then

Q(yΛn) = exp((SΛn fψ(y))− |Λn|P( fψ)

)∗ DΛn(y),

where(SΛg)(z) = ∑

k∈Λg(Sk

z)

andlogDΛn(x), log DΛn(y) = o(|Λn|).

Therefore,

logQ(yΛn) = ∑k∈Λn

fψ(Sky)− |Λn|P( fψ) + o(|Λn|)

= log ∑xΛn∈ϕ−1(yΛn )

exp

(∑

k∈Λn

fϕ(Tkx)− |Λn|P( fϕ)

)+ o(|Λn|).

Divide by |Λn| and take lim sup/lim

1|Λn|

logQ(yΛn) =1|Λn| ∑

k∈Λn

fψ(Sky)− P( fψ) +

o(|Λn|)|Λn|

→ f ∗ψ(y)− P( fψ)

P( fϕ, π)(y)− P( fϕ).

Thus,f ∗ψ(y)− P( fψ) = P( fϕ, π)(y)− P( fϕ).

5.5. CRITERIA FOR PRESERVATION OF GIBBS PROPERTY UNDER RENORMALIZATION71

Lemma 5.24. If P ∈ G(ϕ), Q = ϕ∗P ∈ G(ψ), then

f ∗ψ(y) = P( fϕ, π)(y) + const,

wheref ∗ψ(y) = lim sup

n→∞

1|Λn| ∑

k∈Λn

fψ(Sky),

P( fϕ, π)(y) is the relative pressure of f on the fibre Xy,

Xy = x ∈ AZd

: ϕ(x) = y = ϕ−1(y).

Problem/Question: Given P( fϕ, π)(y), translation invariant function onBZ

d, decide whether there exists “smooth” fψ : BZ

d → R such that f ∗ψ(y) =P( fϕ, π)(y).

After last slides

∫f (x)µc(dx) =

1ν(c)

∫π−1(c)

f (x)µ(dx), c y.

1µ(c)

∫c

f (x)µ(dx), c .

Pyn := P(·|yn

0) = P(·|π−1(yn0))

Pyn → Py

Theorem 5.25 (Tjur). Suppose y0 ⊆ y; Pyn → Py as n→ ∞, and y ∈ Y is Tjur

point. Then, the map y0 →M(X), y→ Py is continuous on Y0.

Theorem 5.26. Q = P π−1. If y0 = y, then Q is Gibbs.

Take y ∈ Y, Pyn∞

n=1: measures on Xy (∈ M(Xy)).

(GXy(ϕ) = Gϕ(Xy) ⊇)my := Pyn∞

n=1 = all limit points of Pyn.

• |GXy(ϕ)| = 1 for any y if and only if Q is Gibbs.

• If |my| = 1 for any y, then Q is Gibbs. The converse is an openproblem.


Part II

Applications

73

Chapter 6

Hidden Markov Models

Hidden Markov Models, briefly discussed in Chapter 3 , are extremelypopular models with a broad range of applications, from biological se-quence analysis to speech recognition. The theory, as well as possibleapplications, are very well covered in the literature [20–24].

In this chapter we briefly review basic facts of the theory of HiddenMarkov Models and discuss thermodynamic aspects of some of this the-ory.

6.1 Three basic problems for HMM’s

As defined earlier, the Hidden Markov Process , is a random functionof a Markov Process. More specifically, suppose Xn is Markov chainswith values in A, transition probability matrix P and initial distributionπ. Then Yn is constructed as follows: independently for every n, Yn ischosen according to some emission distribution, which only depends onthe value of the underlying Markov chain Xn:

P(Yn = j|Xn = u) = Πij ∀i ∈ A, j ∈ B.

The key property of Hidden Markov chains is that Yn depends only onthe value of Xn:

75

76 CHAPTER 6. HIDDEN MARKOV MODELS

The Markov processes are characterized by the Markov property: thefuture (Xn+1, Xn+2, . . .) and the past (. . . , Xn−2, Xn−1) are conditionallyindependent, given the present Xn:

. . . , Xn−3, Xn−2, Xn−1 ⊥Xn Xn+1, Xn+2, . . .

The Hidden Markov processes inherit this property: the future (Yn+1, Yn+2, . . .)and the past (. . . , Yn−2, Yn−1) are conditionally independent, given thepresent Xn:

. . . , Yn−3, Yn−2, Yn−1 ⊥Xn Yn+1, Yn+2,

while

. . . , Yn−3, Yn−2, Yn−1 6 ⊥YnYn+1, Yn+2,

unless Yn is Markov.

The Hidden Markov Process is fully determined by λ = (π, P, Π). As inthe case of functions of Markov chains, we will denote by Q = Qλ the lawof Yn with parameters λ. We will continue to use P to denote the jointdistribution of Xn and Yn.There are three important questions one should be able to answer:

(1) The Evaluation Problem: Given the parameters λ = (π, P, Π) of themodel, what is the efficient method to evaluate Q [Y0:n = y0:n]? Notethat

Q [Y0:n = y0:n] = ∑x0:n∈An+1

P [X0:n = x0:n]Π [Y0:n = y0:n|X0:n = x0:n]

= ∑x0:n∈An+1

πx0

n−1

∏i=0

Pxi,xi+1

n

∏i=0

Πxi,yi ,

and the sum involves exponentially many terms.

(2) The Decoding or State Estimation Problem: Given the observationsY0:n = y0:n and assuming that the parameters (π, P, Π) are know,find the most probable “input” realization X0:n?

(3) The Learning Problem: Given the observations Y0:n = y0:n, estimatethe parameters λ = (π, P, Π). For example, one could be interestedin the maximum likelihood estimate

λ = argmaxλQλ(Y0:n = y0:n).

6.1. THREE BASIC PROBLEMS FOR HMM’S 77

6.1.1 The Evaluation Problem

It turns out that one does not need to evaluate exponentially many termsin order to "score" the sequence Yn

0 .

For 0 ≤ m ≤ n, let us introduce the forward variables

αm(i) = P [Y0 = y0, . . . , Ym = ym, Xm = i] , i ∈ A,

and the backward variables

βm(i) = P [Ym+1 = ym+1, . . . , Yn = yn|Xm = i] , i ∈ A.

- Exercise 6.1. Show that the forward and backward variables satisfy thefollowing recurrence relations:

α0(i) = π(i)Πiy0 , αm+1(i) = ∑i′∈A

αm(i′) · pi′i ·Πiym+1 , m = 0, . . . , n− 1,

and

βn(i) = 1, βm(i) = ∑i′∈A

βm+1(i′)pii′ ·Πi′,ym+1, m = 0, . . . , n− 1.

Proposition 6.2. For every 0 ≤ m ≤ n, one has

Q [Y0 = y0, . . . , Yn = yn] = ∑i∈A

αm(i)βm(i).

Proof. One has

Q [Y0:n = y0:n] = ∑iP [Y0:n = y0:n, Xm = i]

= ∑iP [Y0:n = y0:n|Xm = i]P [Xm = i]

= ∑iP [Y0:m = y0:m, Ym+1:n = ym+1:n|Xm = i]P [Xm = i]

= ∑iP [Y0:m = y0:m|Xm = i]P [Ym+1:n = ym+1:n|Xm = i]P [Xm = i]

= ∑iP [Y0:m = y0:m, Xm = i]P [Ym+1:n = ym+1:n|Xm = i]

= ∑i

αm(i)βm(i).


6.1.2 The Decoding or State Estimation Problem

Suppose we observed yn0 ∈ Bn+1 as the realization of the Hidden Markov

chain Yn0 , and now we would like to estimate the most "probable" sequence

of states that lead to this output.

Let us introduce the following probability distributions on A:

Pt = P(Xt = ·|Yn0 = yn

0), t = 0, . . . , n.

For t < n, Pt is called a smoothing probability; for t = n, Pt is called thefiltering probability. The definition also makes sense for t > n, then onespeaks of the prediction probabilities.

If we know or estimate Pt, we could estimate the state Xt as

Xt = argmaxi∈A

Pt(i) = argmaxi∈A

P(Xt = i|Yn0 = yn

0).

That is an optimal choice from the Bayesian point of view, as it leads tothe minimal expected loss

1n

n

∑t=0I[Xt 6= Xt].

One the other hand, it is very well possible that as a realization of a Markovchain, Xt : t = 0, . . . n has very low probability (possibly, even proba-bility 0). That is why one sometimes applies the maximum a posteriori(MAP) decision rule:

Xn0 = argmax

xn0∈An+1

P[Xn0 = xn

0 | Yn0 = yn

0 ] (6.1.1)

The famous Viterbi’s algorithm provides an efficient method to find max-imizers in (6.1.1). However, Viterbi’s algorithm does not necessarily min-imize the number of “symbol” errors. We will concentrate on studyingproperties of smoothing probabilities.

6.2 Gibbs property of Hidden Markov Chains

In Chapter ??, we have stated without a proof that functions of Markovchains with strictly positive transition probability matrices are regular. Inthis section, we give a proof of this theorem in case of Hidden Markovchains.

6.2. GIBBS PROPERTY OF HIDDEN MARKOV CHAINS 79

Theorem 6.3. Suppose Yn is a Hidden Markov Chain for a stationary Markovchain Xn with P > 0 and emission matrix Π > 0. Then Yn is regular, i.e.,Q is a g-measure, as well as Gibbs.

- Exercise 6.4. Under conditions of Theorem 6.3, there exist α and β suchthat

0 ≤ α ≤ Q[Y0 = y0|Yn1 = yn

1 ] ≤ β < 1

for all n ∈ N and y ∈ BZ+ .

Proof. For any y = (y0, y1, . . . ) ∈ BZ+ let

gn(y) = gn(yn0) = Q[Y0 = y0|Yn

1 = yn1 ],

gn(y, i) = P[Y0 = y0|Yn1 = yn

1 , Xn+1 = i],gn(y) = max

ign(y, i), gn(y) = min

ign(y, i).

Step 1: gn(y) ≤ gn(y) ≤ gn(y) for all y ∈ BZ+ . Since

gn(y) = Q[y0|yn1 ] = ∑

iP(y0|yn

1 , Xn+1 = i) · P(Xn+1 = i|yn1),

we have

gn(y) ≤ gn(y) ·(

∑i∈AP(Xn+1 = i|yn

1)

)= gn(y).

Similarly, gn(y) ≤ gn(y).

Step 2: One hasgn+1(y, i) = ∑

jgn(y, j)w(j, i),

where w(j, i) > 0 and ∑j w(j, i) = 1, are given by

w(j, i) =pjiP[Xn+1 = j|yn

0 ]P[Yn+1 = yn+1|Xn+1 = j]∑j pjiP[Xn+1 = j|yn

0 ]P[Yn+1 = yn+1|Xn+1 = j]

=pjiP[Xn+1 = j|yn

0 ]Πjyn+1

∑j pjiP[Xn+1 = j|yn0 ]Πjyn+1

.

Let gn(y) = gn(y, j∗) : Then

gn+1(y, i) ≤ gn(y) ·min w(j, i) + gn(y) · (1−min w(i, j)).

Step 3: The inequalities above imply the following key inequalities: forsome ρ ∈ (0, 1),

gn+1(y) ≤ ρgn(y) + (1− ρ)gn(y)gn+1(y) ≥ ρgn(y) + (1− ρ)gn(y)

,


and hence

0 ≤ gn+1(y)− gn+1(y) ≤ (1− 2ρ)(

gn(y)− gn(y))≤ (1− 2ρ)n+1,

Note that we need conditions P > 0 and Π > 0 to conclude that ρ ∈ (0, 1).

Finally, one has

gn+1(y)− gn(y) ≤ gn+1(y)− gn(y) ≤ (1− ρ)(gn(y)− gn(y))

and similarly

−(1− ρ)(gn(y)− gn(y)) ≤ gn+1(y)− gn(y).

Therefore, for every n ≤ m,

∣∣gn(y)− gm(y)∣∣ ≤ θn

1− θ, θ = 1− ρ > 0.

That means that the sequence gn(·) converges uniformly, and hencethe limiting function g = limn gn is a continuous function, which is alsostrictly positive (c.f., Exercise 6.4). Moreover, estimates above show thatvarn(g) decays exponentially, and hence Q, viewed as a measure on BZ, isa Gibbs measure in the DLR sense.

Chapter 7

Denoising

Last time we have seen that if Yn is a HMC, corresponding to / obtainedfrom / a Markov chain Xn with parameters (π, p) and emission kernelsΠ = (Πi,j)i∈A,j∈B, then given finite realization Yn

0 = yn0 we can easily

estimateP(Xt = i|Yn

0 = yn0), 0 ≤ t ≤ n.

Reminder:

Lemma 7.1.

P(Xt = i|Yn0 = yn

0) =αt(i)βt(i)

∑j∈A αt(j)βt(j),

where αt(·), βt(·)nt=0 are the forward and backward variables (probabilities)

given by

α0(i) = π(i)Πiy0 , αt+1(i) = ∑ι∈A

αt(ι)PιiΠiyt+1 , t = 0, 1, . . . , n− 1

andβn(i) = 1, βt(i) = ∑

ι∈Aβt+1(ι)PiιΠιym+1 , t = 0, 1, . . . , n− 1.

In fact, for any t,

αt(i) = P[Yt0 = yn

0 , Xt = i], βt(i) = P[Ynt+1 = yn

t+1|Xt = i].

Decoding: if we know the distribution P(Xt = ·|Yn0 = yn

0), equivalently ifwe know probabilities P(Xt = ·, Yn

0 = yn0) we can predict/guess Xt from

the observed data Yn0 = yn

0 .

81

82 CHAPTER 7. DENOISING

Example: Xt ∈ 0, 1, then optimal choice

Xt =

0, if θt > 1

1, if θt < 0,

where

θt =P(Xt = 0, Yn

0 = yn0)

P(Xt = 1, Yn0 = yn

0)=P(Xt = 0|Yn

0 = yn0)

P(Xt = 1|Yn0 = yn

0).

Predict the most probable out come.

The key to the proof is the Hidden Markov Property

. . . , Yt−2, Yt−1⊥XtYt+1, Yt+2, . . .

Let me now derive another expression for P(Xt = i|Yn0 = yn

0), but usingonly the fact that Yt depends only on Xt, i.e.

and not even using the fact that Xt is Markov.

We want to obtain/derive P(Xt = i|Yn0 = yn

0)/P(Xt = i, Yn0 = yn

0). Con-sider unconditional probabilities

P[Xt = i, Yn0 = yn

0 ] = P[Xt = i, Yt = yt, Yn\t = yn\t]

= P[Yt = yt, Yn\t = yn\t|Xt = i]P[Xt = i]= P[Yt = yt|Xt = i]P[Yn\t = yn\t|Xt = i]P[Xt = i]= ΠiytP[Xt = i, Yn\t = yn\t]

wheren \ t = 0, . . . , n \ t.

Thus,

P[Xt = i, Yt = yt|Yn\t = yn\t] = ΠiytP[Xt = i|Yn\t = yn\t].

Let us sum over all possible i: by law of total probability

P[Yt = yt|Yn\t = yn\t] = ∑i

ΠiytP[Xt = i|Yn\t = yn\t]

or in the vector form:

P[Yt = ·|Yn\t = yn\t] = ΠTP[Xt = ·|Yn\t = yn\t].

83

Assume that ΠT is invertible. Then,

P[Xt = ·|Yn\t = yn\t] = Π−TP[Yt = ·|Yn\t = yn\t].

Hence we obtained the following result.

Proposition 7.2. If Yn is a random function of Xn, i.e. if P(Yt = b|Xt =a) = Πab for all t, independently, and the square matrix Π is invertible, then

P(Xt = ·|Yn\t = yn\t) =(

Π−1)TP[Yt = ·|Yn\t = yn\t]

for any t ∈ 0, 1, . . . , n and yn\t ∈ Bn : P(Yn\t = yn\t) > 0.

P(Xt = ·|Yn\t = yn\t) = Π−TP[Yt = ·|Yn\t = yn\t],

P(Xt = ·, Yn\t = yn\t) = Π−TP[Yt = ·, Yn\t = yn\t]. (7.0.1)

But we have also seen

P(Xt = i, Yt = j, Yn\t = yn\t) = ΠijP[Xt = i, Yn\t = yn\t]. (7.0.2)

Hence combining (7.0.1) and (7.0.2), Π = [Π1| . . . |ΠM],

P(Xt = ·, Yn0 = yn

0) = πyt P[Yt = i, Yn\t = yn\t],

whereu v = (u1v1, . . . , uMvM), u, v ∈ RM.

Equivalently,

P(Xt = ·, Yn0 = yn

0) = πyt Π−TP[Yt = ·, Yn\t = yn\t].

Proposition 7.3.

P(Xt = ·, Yn0 = yn

0) = πyt Π−TP[Yt = ·, Yn\t = yn\t].

Corollary 7.4.

P[Xt = ·|Yn0 = yn

0 ] =πyt Π−TP[Yt = ·, Yn\t = yn\t]⟨

πyt Π−TP[Yt = ·, Yn\t = yn\t], 1⟩ .


Remark: The key observation

P(Xt = ·, Yn\t = yn\t) = Π−TP[Yt = ·, Yn\t = yn\t]

is due to Weissman, Ordenlich, Seroussi, Verdu, Weinberger, IEEE Trans.Inf. Theory 51 (2005).

2. Denoising Problem

An+1 3 Xn0 → CHANNEL→ Yn

0 ∈ An+1

P(Yk = j|Xk = i) = Πij, Π = (Πij): square matrix.

Denoiser is a function F : An+1 → An+1 defined by F(Yn0 ) = Xn

0 .

Loss function Λ : A× A→ [0,+∞).

Average cumulative loss

LF(Xn0 , yn

0) =1

n + 1

n

∑t=0

Λ(Xt, F(yn0)t) =

1n + 1

n

∑t=0

Λ(Xt, Xt).

Suppose W is a random variable with value in A, distributed accordingto some probability P. Given also is a loss function Λ : A× A→ [0,+∞).Suppose your prediction is w. Expected loss:

∑a∈A

Λ(a, w)P(a) = Ep ∧ (w, w).

Optimal prediction (=minimizing expected loss):

w = arg minb∈A

∑a∈A

Λ(a, b)P(a) = arg minb∈A

λTb · ~P,

if A = 1, . . . , M, ~P = (P(w = ·)): column vector M× 1, Λ = [λ1|λ2| . . . |λM],where each λi is a column in Λ.

This was an optional choice for 1 experiment.

Suppose Wnn≥0 realization of some ergodic process, with marginal dis-tribution P(·) on A. Then W = arg min

b∈AλT

b · ~P is still an optional choice: By

Ergodic Theorem

1n + 1

n

∑t=0

Λ(Wt, W)→ EP(W0, W) = min

almost surely.

In practice (= denoising setup), P 6= const, Pt = P [Xt = ·∣∣Yn

0 ] changeswith t, but choosing optional Xt for every t is still the “right” strategy.

85

LX = LF =1

n + 1

n

∑t=0

Λ(Xt, Xt(Yn0 )).

Example: 0/1 loss: Λ(a1, a2) =

0, a1 = a2,1, a1 6= a2.

LF =1

n + 1

∣∣∣t : Xt 6= Xt∣∣∣ .

• Xn0 is unknown to us, Yn

0 is observed. We would like to find Xn0 ,

which minimizes LF.

• LF is a random variable, depends on

Xn

0 ∼ P,Yn

0 ∼ Π( · |Xn0 ).

– Semi-stochastic settings: X∞0 ∈ AZ+ is fixed, randomness/averaging

comes from channel.

– Stochastic settings: Average both w.r.t. X and channel.

quenched v.s. annealed

Denoiser:

Xt(yn0) = arg min

x∈AλT

x · P(Xt = · |Yn0 = yn

0), ∀t = 0, . . . , n.

• NON-UNIVERSAL SETUPWe know distribution of Xn

0 and channel (Π), hence we can com-pute P(Xt = · |Yn

0 = yn0).

Example: Hidden Markov setup: We know Xn0 ∼ P, (π, P).

• UNIVERSAL SETUP: known channelWe do not know the probability law of Xn

0. But we know channelcharacteristics (Π).

Xt = arg minx∈A

λTXP[Xt = · |Yn

0 = yn0 ]

= arg minx∈A

λTXP[Xt = ·, Yn

0 = yn0 ]

We have shown that in case Π is invertible,

P[Xt = ·, Yn0 = yn

0 ] = πyt Π−1P[Yt = ·, Yn\t = yn\t

]Example:


0 0

1 1

1− ε

εε

1− ε

Xt ∈ 0, 1, Π =

1− ε ε

ε 1− ε

, 0 <

ε < 12 .

Π−1 =1

1− 2ε

1− ε −ε

−ε 1− ε

.

Λ : 0, 12 → R+ is the Hamming 0/1-loss

Λ(a, a) =

0, a = a,1, a 6= a.

Xt =

0, if θt ≥ 1,1, if θt < 1,

θt =P[Xt = 0, Yn

0 = yn0 ]

P[Xt = 1, Yn0 = yn

0 ].

Suppose yt = 0, π0 =

1− ε

ε

.

P[Xt = 0, Yn0 = yn

0 ]

P[Xt = 1, Yn0 = yn

0 ]

=

1− ε

ε

11− 2ε

1− ε −ε

−ε 1− ε

P(Yt = 0, Yn\t = yn\t)

P(Yt = 1, Yn\t = yn\t)

.

Let a = P(Yt = 0 |Yn\t = yn\t), 1− a = P(Yt = 1 |Yn\t = yn\t). Then

Xt = 0 if for

ut

vt

=

1− ε

ε

1− ε −ε

−ε 1− ε

a

1− a

, ut ≥ vt.

⇐⇒ (1− ε)2a− ε(1− ε)(1− a) ≥ −ε2a + ε(1− ε)(1− a)[(1− ε)2 + ε(1− ε) + ε2 + ε(1− ε)

]a ≥ 2ε(1− ε)

a ≥ 2ε(1− ε).

Remaining case yt = 1 can be treated similarly, and we obtain the follow-ing result/ conclusion/denoiser:

Xt =

yt, if P(Yt = yt |Yn\t = yn\t) ≥ 2ε(1− ε),yt, if P(Yt = yt |Yn\t = yn\t) < 2ε(1− ε).

87

Problem: How do you obtain

P(Yt = · |Yn\t = yn\t)

from one realization/observation Yn0 = yn

0?

DUDE: Discrete Universal Denoiser

P(Yt = yt |Yn\t = yn\t) = P(Yt = yt |Yt−10 = yt−1

0 , Ynt+1 = yn

t+1)

∼= P(Yt = yt |Yt−1t−k = yt−1

t−k, Yt+kt+1 = yt+k

t+1).︸︷︷︸Use these probabilities in denoising.

Remark: “ ∼= ” is essentially a GIBBSIAN idea!

Conditional Probabilities

P(Yt = yt |Yt−1t−k = yt−1

t−k, Yt+kt+1 = yt+k

t+1)

are easy to estimate: For a given left context at−1t−k , right context bt+k

t+1 and any

letter ct, count occurrences of word at−1t−kctbt+k

t+1 in Yn0 m(at−1

t−k, ct, bt+kt+1) .

P(Yt = ct |Yt−1t−k = at−1

t−k, Yt+kt+1 = bt+k

t+1) =m(at−1

t−k, ct, bt+kt+1)

∑ct∈A

m(at−1t−k, ct, bt+k

t+1).

k = dc log|A| ne, c < 12 .

DUDE works:SEMI-STOCHASTIC SETTING: X = (X0, X1, . . . ) ∈ AZ+ is fixed.

Dk(xn0 , zn

0) = minXn∈Sk

LXn(xn−kk+1 , zn

0) = minf :Ak+1→A

1n− 2k

n−k

∑t=k+1

Λ(

xt, f (zt+kt−k)

).

Theorem 7.5. Provided kn|A|2kn = O(n/ log n).

• limn→∞

(LXn

DUDE(Xn

0 , Yn0 )− Dkn(Xn

0 , Yn0 ))= 0 (a.s.).

• P

LXnDUDE

(Xn0 , Yn

0 )− Dkn(Xn0 , Yn

0 )

>δ

≤ C1(k+ 1)|A|2k+1 ∗ exp(− (n− 2k)δ

C2(k + 1)|A|2k

).

STOCHASTIC SETTINGS: Both, the input Xn0 and the output Yn

0 are random.

• D(PXn , Π) = minXn∈Dn

E[LXn(Xn

0 , Yn0 )].


• For stationary processes Xn: D(PX, Π) = limn→∞

D(PXn , Π) = infn≥1D(PXn , Π).

Theorem 7.6. Provided kn|A|2kn = O(n).

limn→∞

ELXnDUDE

(Xn0 , Yn

0 ) = D(PX, Π).

Remark: Idea of approximation

P(Yt = · |Yn\t = yn\t)∼= P(Yt = · |Yt−1

t−k = yt−1t−k, Yt+k

t+1 = yt+kt+1)

is similar (∼= identical) to “approximation”

P(Yt = · |Yt−1−∞ ) or P(Yt = · |Yt−1

0 = yt−10 )

byP(Yt = · |Yt−1

t−k = yt−1t−k),

i.e., to saying that Yt is essentially a k-step Markov chain.

Fact: If Ytt∈Z is a stationary process, consider the k-step Markov ap-proximation Yt

P[Yt = yt |Yt−1t−k = yt−1

t−k] = P[Yt = yt |Yt−1t−k = yt−1

t−k],

then P→ P as k→ ∞.

However, it is known that we can do “better”, e.g., we can achieve samequality of approximation with fewer parameters. Information Theory de-veloped various efficient algorithms, e.g., for estimating Variable LengthMarkov Models from data.

The natural question : whether we can do “better” in two-sided context,i.e., what are the good models and methods to construct appropriate two-sided approximations was raised in information theory.

However, it turns out that it is not “easy” to extend existing efficientone-sided algorithms to the two-sided case.

Appendix A

Prerequisites in Measure Theory,Probability, and Ergodic Theory


We start by recalling briefly basic notions and facts which will be used insubsequent chapters.

A.1 Notation

The sets of natural numbers, integers, non-negative integers, and real num-bers will be denoted by N,Z,Z+,R, respectively. The set of extended realnumbers R is R∪ ±∞. If Λ and V are some non-empty sets, by VΛ wewill denote the space of V-valued functions defined on Λ:

VΛ = f : Λ→ V.

If Π ⊆ Λ and f ∈ VΛ, by fΠ we will denote the restriction of f to Π, i.e.,fΠ ∈ VΠ. If Π1 and Π2 are disjoint, f ∈ VΠ1 , and g ∈ VΠ2 , by fΠ1 gΠ2

we will denote the unique element h ∈ VΠ1∪Π2 such that hΠ1 = f andhΠ2 = g.

A.2 Measure theory

A.2.1. Suppose Ω is some non-empty set, a collection A of subsets of Ω iscalled a σ-algebra if it satisfies the following properties:

(i) Ω ∈ A;

89

90 APPENDIX A. PREREQUISITES

(ii) if A ∈ A, then Ac = Ω \ A ∈ A;

(iii) for any sequence Ann∈N where An ∈ A for every n ∈ N, then∪n An ∈ A.

Equivalently, σ-algebra A is a collection of subsets of Ω, closed undercountable operations like intersection, union, complement, symmetric dif-ference. An element of A is called a measurable subset of Ω. An intersec-tion of two σ-algebras is again a σ-algebra. This property allows us, for agiven collection of sets C of Ω, to define uniquely the minimal σ-algebraσ(C) containing C, as an intersection of all σ-algebras containing C.

A pair (Ω,A), where A is some σ-algebra of subsets of Ω, is called ameasurable space. If the space Ω is metrizable (e.g., Ω = [0, 1] or R), onecan define the Borel σ-algebra of Ω, denoted by B(Ω), as the minimalσ-algebra containing all open subsets of Ω.

A.2.2. Suppose (Ω1,A1) and (Ω2,A2) are two measurable spaces. Themap T : Ω1 → Ω2 is called (A1,A2)-measurable if for any A2 ∈ A2, thefull preimage of A2 is a measurable subset of Ω1, i.e.,

T−1A2 = ω ∈ Ω1 : T(ω) ∈ A2 ∈ A1.

By definition, a random variable is a measurable map from (Ω1,A1) into(R,B(R)). If (Ω,A) is a measurable space, and T : Ω→ Ω is a measurablemap, then T is called a measurable transformation of (Ω,A) or an endo-morphism of (Ω,A). If, furthermore, T is invertible and T−1 : Ω → Ω isalso measurable, then T is called a measurable isomorphism of (Ω,A).

A.2.3. Suppose (Ω,A) is a measurable space. The function µ : A →[0,+∞] is called a measure if

• µ(∅) = 0;

• for any countable collection of pairwise disjoint sets Ann, An ∈ A,one has

µ(⋃

nAn

)= ∑

nµ(An). (A.2.1)

The triple (Ω,A, µ) is called a measure space. If µ(Ω) = 1, then µ is calleda probability measure, and (Ω,A, µ) a probability (measure) space.

A.2.4. Suppose (Ω,A, µ) is called a measure space. Then for any A ∈ A,the indicator function of A is

IA(ω) =

1, ω ∈ A,0, ω 6∈ A.

A.2. MEASURE THEORY 91

The Lebesgue integral of f = IA is defined as∫IA dµ = µ(A).

A function f is called simple if there exist K ∈ N, measurable sets AkKk=1

with Ak ∈ A, and non-negative numbers αkKk=1 such that

f (ω) =K

∑k=1

αkIAk(ω).

The Lebesgue integral of the simple function f is defined as∫f dµ =

K

∑k=1

αkµ(Ak).

Suppose fn is a monotonically increasing sequence of simple functionsand f = limn fn. Then the Lebesgue integral of f is defined as∫

f dµ = limn

∫fn dµ.

It turns out that every non-negative measurable function f : Ω→ R+ canbe represented as a limit of increasing sequence of simple functions.

For a measurable function f : Ω → R, let f+ and f− be the positive andthe negative part of f , respectively, i.e. f = f+ − f−. The function f iscalled Lebesgue integrable if∫

f+ dµ,∫

f− dµ < +∞

and the Lebegsue integral of f is then defined as∫f dµ =

∫f+ dµ−

∫f− dµ.

The set of all Lebesgue integrable function of (Ω,A, µ), will be denoted byL1(Ω,A, µ), or L1(Ω) and L1(µ), if the latter cases do not lead to confusion.

If (Ω,A, µ) is a probability space, we will sometimes denote the Lebesgueintegral

∫f dµ by Eµ f or E f .

A.2.5. Conditional expectation with respect to a sub-σ-algebra. Suppose(Ω,A, µ) is a probability space, and F is some sub-σ-algebra of A. Sup-pose f ∈ L1(Ω,A, µ) is a A-measurable Lebesgue integrable function. Theconditional expectation of f given F will be denoted by E( f |F ), and isby definition an F -measurable function on Ω such that∫

CE( f |F ) dµ =

∫C

f dµ


for every C ∈ F .

A.2.6. Martingale convergence theorem.

Suppose (Ω,A, µ) is a measure space, and An is a sequence of sub-σ-algebras of A such that An ⊆ An+1 for all n. Denote by A∞ the minimalσ-algebra containing all An. Then for any f ∈ L1(Ω,A, µ)

Eµ( f |An)→ Eµ( f |A∞)

µ-almost surely and in L1(Ω).

A.2.7. Absolutely continuous measures and the Radon-Nikodym Theorem.Suppose ν and µ are two measures on the measurable space (Ω,A). Themeasure ν is absolutely continuous with respect to µ (denoted by ν µ)if

ν(A) = 0 for all A ∈ A such that µ(A) = 0.

The Radon-Nikodym theorem states that if µ and ν are two σ-finite mea-sures on (Ω,A), and ν µ, then there exists a non-negative measurablefunction f , called the (Radon-Nikodym) density of ν with respect to µ,such that

ν(A) =∫

Af dµ, for all A ∈ A.

A.3 Stochastic processes

Suppose (Ω,A, µ) a probability space. A stochastic process Xn is acollection of random variables Xn : Ω→ R, indexed by n ∈ T , where thetime is

T = Z+ or Z.

Stochastic process can be described by the finite-dimensional distributions(marginals): for every (n1, . . . , nk) ∈ T k and Borel sets An1 , . . . , Ank ⊂ R,put

µn1,...,nk(An1 , . . . , Ank) := µ (ω ∈ Ω : Xn1(ω) ∈ An1 , . . . , Xnk(ω) ∈ Ank) .

In the opposite direction, a consistent family of finite-dimensional distribu-tions can be used to define a stochastic process.

Process Xn is called stationary if for all (n1, . . . , nk) ∈ T k, t ∈ Z+, andBorel sets An1 , . . . , Ank ⊆ R one has

µn1+t,...,nk+t(An1 , . . . , Ank) = µn1,...,nk(An1 , . . . , Ank)

i.e., the time-shift does not affect the finite-dimensional marginal distribu-tions.

A.4. ERGODIC THEORY 93

A.4 Ergodic theory

Ergodic Theory originates from the Boltzmann-Maxwell ergodic hypothe-sis and is the study of measure preserving dynamical systems

(Ω,A, µ, T)

where

• (Ω,A, µ) is a (Lebesgue) probability space

• T : Ω→ Ω is measure preserving: for all A ∈ A

µ(T−1A) = µ (x ∈ X : T(x) ∈ A) = µ(A).

Example 1. Let Ω = T1 = R/Z ∼= [0, 1), µ= Lebesgue measure,

• Circle rotation: Tα(x) = x + α mod 1,

• Doubling map: T(x) = 2x mod 1.

Example 2. Consider a finite set (alphabet) A(= 1, . . . , N

), and put

Ω = AZ+ =

! = (ωn)n≥0 : ωn ∈ A

.

The corresponding Borel σ-algebra A is generated by cylinder sets:

[ai1 , . . . , ain ] =

! : ωi1 = ai1 , . . . , ωin = ain)

, aik ∈ A.

A measure p = (p(1), . . . , p(N)) on A can be extended to a measureµ = pZ+ on (Ω,A)

µ(

! : ωi1 = ai1 , . . . , ωin = ain)

= p(ai1) · · · p(ain).

The measure µ is clearly preserved by the left shift σ : Ω→ Ω(σ!)

n = ωn+1 ∀n.

Quadruple (Ω,A, µ, σ) is called the p-Bernoulli shift.

Definition A.1. Measure-preserving dynamical system (Ω,A, µ, T) is er-godic if every invariant set is trivial:

A = T−1A ⇒ µ(A) = 0 or 1,

equivalently, if every invariant function is constant:

f (ω) = f (T(ω)) (µ− a.e.) ⇒ f (ω) = const (µ− a.e.).


Theorem A.2 (Birkhoff’s Pointwise Ergodic Theorem). Suppose (Ω,A, µ, T)is an ergodic measure-preserving dynamical system. Then for all f ∈ L1(Ω, µ)

1n

n−1

∑t=0

f (Tt(x))→∫

f (x)µ(dx) as n→ ∞

µ-almost surely and in L1.

A.5 Entropy

Suppose p = (p1, . . . , pN) is a probability vector, i.e.,

pi ≥ 0,N

∑i=1

pi = 1.

The entropy of p is

H(p) = −N

∑i=1

pi log2 pi.

A.5.1 Shannon’s entropy rate per symbol

Definition A.3. Suppose Y = Yk is a stationary process with values in afinite alphabet A, and µ is the corresponding translation invariant measure.Fix n ∈ N, and consider the distribution of n-tuples (Y0, . . . , Yn−1) ∈ An.Denote by Hn entropy of this distribution:

Hn = H(Yn−10 ) = − ∑

(a0,...,an−1)∈An

P[Yn−10 = an−1

0 ] log2 P[Yn−10 = an−1

0 ].

The entropy (rate) of the process Y = Yk, equivalently, of the measureP, denoted by h(Y), h(P), or hσ(P), is defined as

h(Y) = h(P) = hσ(P) = limn→∞

1n

Hn

(the limit exists!).

Similarly, the entropy can be defined for stationary random fields Ynn∈Zd ,Yn ∈ A. Let Λn = [0, n− 1]d ∩Zd. Then

h(Y) = limn→∞− 1|Λd| ∑

aΛn∈AΛn

P[YΛn = aΛn ] log2 P[YΛn = aΛn ],

again one can easily show that the limit exists.

A.5. ENTROPY 95

A.5.2 Kolmogorov-Sinai entropy of measure-preserving sys-tems

Suppose (Ω,A, µ, T) is a measure-preserving dynamical system. Suppose

C = C1, . . . , CN

is a finite measurable partition of Ω, i.e., a partition of Ω into measurablesets Ck ∈ A.

For every ω ∈ Ω and n ∈ Z+ (or Z), put

Yn(x) = YCn (ω) = j ∈ 1, . . . , N ⇔ Tn(ω) ∈ Cj.

Proposition: For any C, the corresponding process YC = Yn, Yn : Ω →1, . . . , N, is a stationary process with

P[Y0 = j0, . . . , Yn = jn] = µ(

ω ∈ Ω : ω ∈ Cj0 , . . . , Tn(x) ∈ Cjn)

.

Definition A.4. If (Ω,A, µ, T) is a measure preserving dynamical system,and C = C1, . . . , CN is a finite measurable partition of Ω, then the en-tropy of (Ω,A, µ, T) with respect to C is defined as as the Shannon entropyof the corresponding symbolic process YC:

hµ(T,C) = h(YC).

Finally, the measure-theoretic or the Kolmogorov-Sinai entropy of (Ω,A, µ, T)is defined as

hµ(T) = supC is finite

hµ(T,C).

The following theorem of Sinai eliminates the need to consider all finitepartitions.

Definition A.5. A partition C is called generating (or, a generator) for thedynamical system (Ω,A, µ, T) if the smallest σ-algebra containing setsTn(Cj), j = 1, . . . , N, n ∈ Z, is A.

Theorem A.6 (Ya. Sinai). If C is a generating partition then

hµ(T) = hµ(T,C)


Bibliography

[1] D. Capocaccia, A definition of Gibbs state for a compact set with Zν action, Comm. Math.Phys. 48 (1976), no. 1, 85–88. MR0415675 (54 #3755) ↑28

[2] R.L Dobrushin, Description of a random field by means of conditional probabilities andthe conditions governing its regularity., Theor. Prob. Appl. 13 (1968), 197–224. ↑12

[3] O. E. Lanford III and D. Ruelle, Observables at infinity and states with short rangecorrelations in statistical mechanics, Comm. Math. Phys. 13 (1969), 194–215. ↑12

[4] Ja. G. Sinaı, Gibbs measures in ergodic theory, Uspehi Mat. Nauk 27 (1972), no. 4(166),21–64 (Russian). MR0399421 (53 #3265) ↑27

[5] O. K. Kozlov, A Gibbs description of a system of random variables, Problemy PeredaciInformacii 10 (1974), no. 3, 94–103. ↑16

[6] W. G. Sullivan, Potentials for almost Markovian random fields, Comm. Math. Phys. 33(1973), 61–74. ↑16

[7] Hans-Otto Georgii, Gibbs measures and phase transitions, de Gruyter Studies in Math-ematics, vol. 9, Walter de Gruyter & Co., Berlin, 1988. ↑11, 16

[8] , Gibbs measures and phase transitions, 2nd ed., de Gruyter Studies in Math-ematics, vol. 9, Walter de Gruyter & Co., Berlin, 2011. MR2807681 (2012d:82015)↑11

[9] François Ledrappier, Principe variationnel et systèmes dynamiques symboliques, Z.Wahrscheinlichkeitstheorie und Verw. Gebiete 30 (1974), 185–202. MR0404584 (53

#8384) ↑21, 27

[10] N. Berger, C. Hoffman, and V. Sidoravicius, Nonuniqueness for specifications in `2+ε,ArXiv Mathematics e-prints (2003), available at math/0312344. ↑27

[11] Rufus Bowen, Equilibrium states and the ergodic theory of Anosov diffeomorphisms,Lecture Notes in Mathematics, Vol. 470, Springer-Verlag, Berlin, 1975. MR0442989

(56 #1364) ↑11, 28

[12] , Equilibrium states and the ergodic theory of Anosov diffeomorphisms, Secondrevised edition, Lecture Notes in Mathematics, vol. 470, Springer-Verlag, Berlin,2008. With a preface by David Ruelle; Edited by Jean-René Chazottes. MR2423393

(2009d:37038) ↑11

[13] Maury Bramson and Steven Kalikow, Nonuniqueness in g-functions, Israel J. Math.84 (1993), no. 1-2, 153–160, DOI 10.1007/BF02761697. MR1244665 (94h:28011) ↑27

[14] Wolfgang Doeblin and Robert Fortet, Sur des chaînes à liaisons complètes, Bull. Soc.Math. France 65 (1937), 132–148 (French). MR1505076 ↑20

97

math/0312344

98 BIBLIOGRAPHY

[15] Sandro Gallo and Frédéric Paccaut, On non-regular g-measures, Nonlinearity 26(2013), no. 3, 763–776, DOI 10.1088/0951-7715/26/3/763. MR3033569 ↑27

[16] Anders Johansson, Anders Öberg, and Mark Pollicott, Unique Bernoulli g-measures,J. Eur. Math. Soc. (JEMS) 14 (2012), no. 5, 1599–1615, DOI 10.4171/JEMS/342.MR2966661 ↑27

[17] Michael Keane, Strongly mixing g-measures, Invent. Math. 16 (1972), 309–324.MR0310193 (46 #9295) ↑20

[18] David Ruelle, Thermodynamic formalism, Encyclopedia of Mathematics and its Ap-plications, vol. 5, Addison-Wesley Publishing Co., Reading, Mass., 1978. The math-ematical structures of classical equilibrium statistical mechanics; With a forewordby Giovanni Gallavotti and Gian-Carlo Rota. MR511655 (80g:82017) ↑11, 16

[19] R. Fernández and G. Maillard, Chains with complete connections and one-dimensionalGibbs measures, Electron. J. Probab. 9 (2004), 145–176. ↑27

[20] Iain L. MacDonald and Walter Zucchini, Hidden Markov and other models for discrete-valued time series, Monographs on Statistics and Applied Probability, vol. 70, Chap-man & Hall, London, 1997. MR1692202 (2000g:62009) ↑75

[21] Timo Koski, Hidden Markov models for bioinformatics, Computational Biology Series,vol. 2, Kluwer Academic Publishers, Dordrecht, 2001. MR1888250 (2003g:62003)↑75

[22] Walter Zucchini and Iain L. MacDonald, Hidden Markov models for time series, Mono-graphs on Statistics and Applied Probability, vol. 110, CRC Press, Boca Raton, FL,2009. An introduction using R. MR2523850 (2010k:62001) ↑75

[23] L. Rabiner and B.H. Juang, An introduction to hidden Markov models, ASSP Magazine,IEEE 3 (1986), no. 1, 4-16. ↑75

[24] Yariv Ephraim and Neri Merhav, Hidden Markov processes, IEEE Trans. Inform.Theory 48 (2002), no. 6, 1518–1569, DOI 10.1109/TIT.2002.1003838. Special issue onShannon theory: perspective, trends, and applications. MR1909472 (2003f:94024)↑75

Documents

Hidden Gibbs Models: Theory and Applicationspub.math.leidenuniv.nl/~verbitskiyea/pages/files/... · Bernoulli random variables with P(Zn = 1) = e, or the channel can have memory,