Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Probabilistic graphical models

Juri Kuronen

September 5, 2019

Juri Kuronen Probabilistic graphical models

Introduction to graphical models

Koller et al. (2007):

Graphical models are an elegant framework to compactlyrepresent complex, real-world phenomena.

In a nutshell, graphical models combine dealing withuncertainty through the use of probability theory and copingwith complexity through the use of graph theory.

The framework is general: many common statistical models,such as hidden Markov models, Ising models etc., can bedescribed as graphical models.



At a high level, the goal is to efficiently represent a jointdistribution over some set of random variablesXV “ tX1, . . . ,Xdu, here assumed to be discrete and takingvalues in XV “

Śdj“1Xj .

Even in the simplest case where the variables arebinary-valued, a joint distribution requires the specification of2d numbers.

However, it is often the case that there is some structure inthe distribution that allows us to factor the representation ofthe distribution into modular components.

The structure that graphical models exploit is theindependence properties that exist in many real-worldphenomena.



The two common types of graphical models are Bayesiannetworks (aka directed graphical models, belief networks orcausal networks) and Markov networks (aka undirectedgraphical models or Markov random fields).

The structure of a graphical model is presented by a graph,which compactly encodes the dependence relations betweenthe variables.

A set of numerical parameters over this structure then specifythe joint distribution of the model.


A brief history

Gibbs (1902) in statistical physics: representing thedistribution of a system of interacting particles.

Wright (1921) in genetics: studying inheritance in naturalspecies, Wright’s work later utilized in economics (Wold,1954) and social sciences (Blalock, 1971).

Breakthrough in the 80’s with proper theoretical developmentby Lauritzen and Spiegelhalter (1988): ”Local computationswith probabilities on graphical structures and their applicationto expert systems”.

Bayesian network framework by Pearl (1988), soon utilized inexpert systems, e.g. Pathfinder by Heckerman et al. (1992), asystem that assists surgical pathologists with the diagnosis oflymph-node diseases.

Since then applied to a large number of fields, includingbioinformatics, social science, control theory, imageprocessing, marketing analysis, etc...


Graphs in the context of graphical models

In mathematics, and more specifically in graph theory, a graphconsists of a set of objects called vertices or nodes and a set ofedges linking pairs of vertices which are in some sense related.

Formally, a graph is a pair G “ pV ,E q, where V is the vertexset V “ t1, . . . , du and E is the edge set E Ă V ˆ V .

In the context of graphical models, G is a graph over XV ,where each node in V corresponds to a random variable in theset XV and the edges E represent dependencies between thevariables.


Bayesian networks

X1

X2 X3 X5 X6

X4

Figure: An example of a directed (acyclic) graph.

Bayesian networks:

Useful for modeling probabilistic influence between variablesthat have clear directionality.

Often used to represent causal relationships.

For example hidden Markov models or neural networks can beconsidered special cases of Bayesian networks.


Factorization in Bayesian networks

X1

X2 X3 X5 X6

X4

Figure: An example of a directed (acyclic) graph.

Factorization in Bayesian networks:

ppX1, . . . ,X6q “ ppX1qppX2 | X1qppX3 | X1,X2q . . . ppX6 | X1, . . . ,X5q

“ ppX1qppX2 | X1qppX3 | X2qppX4 | X1,X3qppX5 | X3qppX6 | X5q

“

6ź

j“1

ppXj | Xparentspjqq


Markov networks

X1

X2

X3

X4

X5

X6

X7

Figure: An example of an undirected graph.

Markov networks:

Useful in modeling phenomena where one cannot naturallyascribe a directionality to the interactions between variables.

The interactions can be thought of as more like “correlations”.

Simpler to understand independence structure andfactorization, but computationally more demanding.


Markov properties in Markov networks

X1

X2

X3

X4

X5

X6

X7


The edge set E of a Markov network implies the following Markovproperties (in the context of Markov networks, the set of neighborsof node j is called the Markov blanket of j , denoted by mbpjq):

Pairwise Markov property: Xj KK Xj1 | XV ztj,j1u for all pj , j 1q R E .

Local Markov property: Xj KK XV ztjYmbpjqu | Xmbpjq for all j P V .

Global Markov property: XA KK XB | XS for all disjoint subsets

A,B,S Ă V for which S separates A from B.


Factorization in Markov networks

The indepence properties of the joint probability distributionppXV q encoded in graph G imply that ppXV q factorizes as aproduct of potential functions over the structure of G :

ppXV q “1

Z

ź

CPCpGqφC pXC q,

where CpG q contains the cliques of graph G , φC pXC q is a functiontaking real values and Z is a normalizing constant, called thepartition function, defined as

Z “ÿ

xPX

ź

CPCpGqφC pxC q.

Inference in general unstructured Markov networks is difficult dueto the intractability of the partition function Z .


Factorization in Markov networks

X1

X2

X3

X4

X5

X6

X7


Factorization in Markov networks:

ppX1, . . . ,X7q “

1

ZφpX1,X2qφpX1,X3qφpX2,X4qφpX3,X4qφpX4,X5,X6qφpX5,X6,X7q


Structure learning of graphical models

Although the ultimate goal of a graphical model is toefficiently represent a multivariate distribution, the graphalone is also useful for gaining insight into complexdependency patterns among large collections of variables.

The structure learning problem is hard: the space of graphs isa combinational space consisting of a superexponentialnumber of structures – 2Opd

2q.

For example, the number of undirected graph structures with10 nodes is already 35 184 372 088 832.

Structure learning algorithms can generally be classified intotwo broad categories, they either use a constraint-based or ascore-based approach to optimize the network topology.


Constraint-based structure learning

Constraint-based approaches infer the structure through aseries of statistical independence tests.

The local nature of the tests makes this an attractiveapproach from the computational scalability perspective.

However, a particular drawback is that the individual tests aresensitive to noise which can result in incorrect independenceassumptions.

For example, an edge discarded by an earlier test can nolonger be recovered.


Score-based structure learning

Score-based approaches operate globally by formulating thestructure learning problem as an optimization problem.

This requires a scoring function by which the plausibility ofthe different candidates can be evaluated.

Additionally, this requires an optimization algorithm forfinding high-scoring graphs since an exhaustive evaluation is ingeneral infeasible.


Bayesian–Dirichlet score for Bayesian networks

Let’s start by introducing a Bayesian approach for learningBayesian networks (Heckerman et al., 1995).

In the Bayesian approach, we are of course interested in theposterior probability:

ppG | XV q “ppXV | G qppG q

ppXV q.

A natural approach would be to take maximum likelihood ofppXV | G q. However, we run into overfitting problems.

Instead, ppXV | G q is the marginal likelihood, evaluated as

ppXV | G q “

ż

ppXV | θG ,G qppθG | G qdθG .

The model averaging in the marginal likelihood is a Bayesianway to deal with overfitting.



With some assumptions, the integral in the marginallikelihood can be solved in closed form.

First, assume that the data is a multinomial sample from aBayesian network such that we have

θjkl “ ppxj “ l | xpapjq “ kq

for all l “ 1, . . . , rj and k “ 1, . . . , qj , where rj “ |Xj | andqj “ |Xparentspjq|.

We can then write

ppxV | θG ,G q “dź

j“1

qjź

k“1

rjź

l“1

θnjkljkl .

Here, njkl is the number of observations where xj is in state land its parents are in state k.



Three key assumptions are related to parameter independence(PI). The parameters associated with each variable (global PI)and parameters associated with each state of the parents of avariable (local PI) are assumed independent. Finally, it’sassumed that ppθjk | G1q “ ppθjk | G2q if Xj has the sameparents in graphs G1 and G2 (parameter modularity).

This allows us to write

ppθG | G q “dź

j“1

ppθj | G q “dź

j“1

qjź

k“1

ppθjk | G q,

where θG “Ťd

j“1 θj , θj “Ťqj

k“1 θjk and θjk “Ťrj

l“1 θjkl .

The PI assumptions don’t always hold in practice, but theyare required from a computational convenience perspective.



Finally, assuming that θjk „ Dirichletpαjk1, . . . , αjkrj q, we get

ppXV | G q “dź

j“1

qjź

k“1

Γpαjkq

Γpαjk ` njkq

rjź

l“1

Γpαjkl ` njklq

Γpαjklq.

Here αjk “řrj

l“1 αjkl and njk “řrj

l“1 njkl .

Note that the above factorizes straightforwardly asppXV | G q “

śdj“1 ppXj | Xparentspjq,G q. Thus, we can write:

ppG | XV q “ const ¨ ppG q ¨dź

j“1

ppXj | Xparentspjq,G q.

The constant is 1ppXV q

that disappears when comparing graphstructures.


Score-based structure learning of Markov networks

Factorizing the score into conditional distributions is not asstraightforward for Markov networks due to the partitionfunction.

For this reason, the earliest score-based methods were limitedto models which constrained the underlying graph to bechordal, since such graphs can be perfectly represented byBayesian networks.

However, the recent surge of pseudo-likelihood-based methodshas made learning of general, non-chordal Markov networkstructures possible.


Pseudo-Likelihood

In the pseudo-likelihood, introduced originally by Besag(1975), the joint probability of an outcome is replaced by aproduct of variable-wise conditional distributions:

ppXV | G q «dź

j“1

ppXj | XV zj ,G q.

Under certain assumptions which generally hold if we assumethat the data was generated from a Markov network, thepseudo-likelihood is a consistent estimator of the modelparameters (Koller and Friedman, 2009).The major advantage of this approximation is that the fullconditional distributions for each variable have a surprisinglysimple form:

dź

j“1

ppXj | XV zj ,G q “dź

j“1

ppXj | Xmbpjq,G q.


Derivation of ppXj | XV zj ,G q “ ppXj | Xmbpjq,G q

Partition CpG q into two disjoint sets Cj “ tC Ď CpG q : j P Cuand Czj “ tC Ď CpG q : j R Cu. That is, CpG q “ Cj Y Czj . Now,

ppxj | XV zj ,G q “ppxj , xV zj | G q

ř

xjppxj , xV zj | G q

“ppXV | G q

ř

xjppXV | G q

“

1Z

“ś

CPCj φpxC q‰“ś

CPCzj φpxC q‰

1Z

ř

xj

“ś

CPCj φpxC q‰“ś

CPCzj φpxC q‰ “

ś

CPCj φpxC qř

xj

ś

CPCj φpxC q

The key observation is that the problematic normalizingconstant Z has disappeared, replaced with local normalizingconstants that are easily computed. Notice also that the lastexpression contains only factors involving Xj and itsneighboring variables (the Markov blanket of Xj), allowingefficient computing. This, in passing, demonstrates the LocalMarkov Property.


Marginal Pseudo-Likelihood (MPL)

With the pseudo-likelihood, similar formulation as in theBayesian–Dirichlet score can be used for Markov networkstructure learning: the parameters θjkl are instead definedwith respect to ppxj “ l | xmbpjq “ kq.Additionally, define the prior ppG q in terms of mutuallyindependent prior beliefs on the individual Markov blankets:

log ppG q “dÿ

j“1

log ppmbpjq | G q.

Now, the MPL score is (with some abuse of notation):

log ppG | XV q “

dÿ

j“1

MPLpj | mbpjqq

“

dÿ

j“1

“

log ppXj | Xmbpjq,G q ` log ppmbpjq | G q‰

.


MPL optimization problem

Since a graph can be defined by its collection of Markovblankets, we are left with the following optimization problem:

argmaxtmbpjqudj“1

«

dÿ

j“1

MPLpj | mbpjqq

ff

.

For a globally consistent graph structure, the above is subjectto j 1 P mbpjq ô j P mbpj 1q for all j , j 1 P V .

Due to the vast discrete optimization space, the optimizationproblem is clearly intractable for large systems.


Two-phase search algorithm

Pensar et al. (2017) utilize a two-phase search algorithm to solvethe problem:

In the first phase, the restriction in the problem is relaxed,resulting in d independent Markov blanket discovery problems.

Each Markov blanket can be learned using a greedyhill-climbing algorithm that is based on two basic operations.

At each iteration, the algorithm adds to the Markov blanketthe node that induces the greatest score improvement.

Each addition step is interleaved with a deletion phase, wherethe algorithm instead chooses a node to delete if it results in ascore improvement.

The d solutions can be combined into a consistent solution as:

E_ “ tpj , j1q P V ˆ V : j P mbpj 1q _ j 1 P mbpjqu.


Local MPL optimization

Algorithm 1: Hill-climbing algorithm to approximatively solve xmbpjq.

mbpjq, xmbpjq Ð ∅while xmbpjq has changed do

mbpiq Ð xmbpjqforeach j 1 P V ztmbpjq Y pjqu do

if MPLpj | mbpjq Y j 1q ą MPLpj | xmbpjqq thenxmbpjq Ð mbpjq Y j 1

while xmbpjq has changed & |xmbpjq| ą 2 do

mbpjq Ð xmbpjqforeach j 1 P mbpjq do

if MPLpj | mbpjqzj 1q ą log ppj | xmbpjqq thenxmbpjq Ð mbpjqzj 1

return xmbpiq


Final optimization phase

In the second phase, the original problem is solved withrespect to the reduced model space G_ from E_, which is ingeneral considerably smaller than G.

Considering the first phase solution as a prescan that identifieseligible edges, another hill-climbing procedure is applied onE_ that in each iteration chooses the highest scoringneighboring graph structure (differing by 1 edge).

The collection of neighboring graph structures is denoted byNGpG q.

Because of the variable-wise factorization, local edge changesin this second phase cause a recalculation of the score for onlytwo variables, meaning that each iteration can be carried outefficiently by caching the edge-wise score differences.


Marginal Pseudo-Likelihood (MPL)

Algorithm 2: Global hill-climbing algorithm.

G , G Ð ∅while G has changed do

G Ð Gforeach G 1 P NGpG q do

if ppXV | G1q ą ppXV | G q then

G Ð G 1

return G


Summary of MPL

Pseudo-likelihood is used to decompose the score into dvariable-wise scores.

The score can be defined similarly to the Bayesian–Dirichletscore, allowing learning of general, non-chordal Markovnetworks.

Can naturally break the graph optimization into dsubproblems.

Smart search strategy utilizing two sequential hill-climbingalgorithms.


References

Besag, J. Statistical analysis of non-lattice data. Journal of theRoyal Statistical Society. Series D (The Statistician), 24: 179–195,1975.

Blalock, H. M. Causal Models in the Social Sciences. Macmillan,London, 1971.

Gibbs, J. W. Elementary Principles in Statistical MechanicsDeveloped with Especial Reference to the Rational Foundation ofThermodynamics. Yale University Press, 1902.

Heckerman, D., Geiger, D. and Chickering, D. M. LearningBayesian Networks: The Combination of Knowledge and StatisticalData. Machine Learning, 20:197–243, 1995.

Koller, D., Friedman, N., Getoor, L. and Taskar, B. Graphicalmodels in a nutshell. In: Getoor, L. and Taskas, B. Introduction toStatistical Relational Learning. MIT Press, 2007.


References

Koller, D. and Friedman, N. Probabilistic Graphical Models:Principles and Techniques. MIT Press, 2009.

Lauritzen, S. L. and Spiegelhalter, D. J. Local computations withprobabilities on graphical structures and their application to expertsystems (with discussion). Journal of the Royal Statistical Society,Series B (Methodological), 50:157–224, 1988.

Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networksof Plausible Inference. Morgan Kaufmann, San Mateo, CA, 1988.

Pensar, J., Nyman, H., Niiranen, J. and Corander, J. MarginalPseudo-Likelihood Learning of Discrete Markov NetworkStructures. Bayesian Analysis 12(4):1195–1215, 2017.

Wold, H. Causality and Econometrics. Econometrica, 22:162–177,1954.

Wright, S. Correlation and Causation. Journal of AgriculturalResearch, 20:557–585, 1921a.


Documents

Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph