31
Probabilistic graphical models Juri Kuronen September 5, 2019 Juri Kuronen Probabilistic graphical models

Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Probabilistic graphical models

Juri Kuronen

September 5, 2019

Juri Kuronen Probabilistic graphical models

Page 2: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Introduction to graphical models

Koller et al. (2007):

Graphical models are an elegant framework to compactlyrepresent complex, real-world phenomena.

In a nutshell, graphical models combine dealing withuncertainty through the use of probability theory and copingwith complexity through the use of graph theory.

The framework is general: many common statistical models,such as hidden Markov models, Ising models etc., can bedescribed as graphical models.

Juri Kuronen Probabilistic graphical models

Page 3: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Introduction to graphical models

At a high level, the goal is to efficiently represent a jointdistribution over some set of random variablesXV “ tX1, . . . ,Xdu, here assumed to be discrete and takingvalues in XV “

Śdj“1Xj .

Even in the simplest case where the variables arebinary-valued, a joint distribution requires the specification of2d numbers.

However, it is often the case that there is some structure inthe distribution that allows us to factor the representation ofthe distribution into modular components.

The structure that graphical models exploit is theindependence properties that exist in many real-worldphenomena.

Juri Kuronen Probabilistic graphical models

Page 4: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Introduction to graphical models

The two common types of graphical models are Bayesiannetworks (aka directed graphical models, belief networks orcausal networks) and Markov networks (aka undirectedgraphical models or Markov random fields).

The structure of a graphical model is presented by a graph,which compactly encodes the dependence relations betweenthe variables.

A set of numerical parameters over this structure then specifythe joint distribution of the model.

Juri Kuronen Probabilistic graphical models

Page 5: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

A brief history

Gibbs (1902) in statistical physics: representing thedistribution of a system of interacting particles.

Wright (1921) in genetics: studying inheritance in naturalspecies, Wright’s work later utilized in economics (Wold,1954) and social sciences (Blalock, 1971).

Breakthrough in the 80’s with proper theoretical developmentby Lauritzen and Spiegelhalter (1988): ”Local computationswith probabilities on graphical structures and their applicationto expert systems”.

Bayesian network framework by Pearl (1988), soon utilized inexpert systems, e.g. Pathfinder by Heckerman et al. (1992), asystem that assists surgical pathologists with the diagnosis oflymph-node diseases.

Since then applied to a large number of fields, includingbioinformatics, social science, control theory, imageprocessing, marketing analysis, etc...

Juri Kuronen Probabilistic graphical models

Page 6: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Graphs in the context of graphical models

In mathematics, and more specifically in graph theory, a graphconsists of a set of objects called vertices or nodes and a set ofedges linking pairs of vertices which are in some sense related.

Formally, a graph is a pair G “ pV ,E q, where V is the vertexset V “ t1, . . . , du and E is the edge set E Ă V ˆ V .

In the context of graphical models, G is a graph over XV ,where each node in V corresponds to a random variable in theset XV and the edges E represent dependencies between thevariables.

Juri Kuronen Probabilistic graphical models

Page 7: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Bayesian networks

X1

X2 X3 X5 X6

X4

Figure: An example of a directed (acyclic) graph.

Bayesian networks:

Useful for modeling probabilistic influence between variablesthat have clear directionality.

Often used to represent causal relationships.

For example hidden Markov models or neural networks can beconsidered special cases of Bayesian networks.

Juri Kuronen Probabilistic graphical models

Page 8: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Factorization in Bayesian networks

X1

X2 X3 X5 X6

X4

Figure: An example of a directed (acyclic) graph.

Factorization in Bayesian networks:

ppX1, . . . ,X6q “ ppX1qppX2 | X1qppX3 | X1,X2q . . . ppX6 | X1, . . . ,X5q

“ ppX1qppX2 | X1qppX3 | X2qppX4 | X1,X3qppX5 | X3qppX6 | X5q

j“1

ppXj | Xparentspjqq

Juri Kuronen Probabilistic graphical models

Page 9: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Markov networks

X1

X2

X3

X4

X5

X6

X7

Figure: An example of an undirected graph.

Markov networks:

Useful in modeling phenomena where one cannot naturallyascribe a directionality to the interactions between variables.

The interactions can be thought of as more like “correlations”.

Simpler to understand independence structure andfactorization, but computationally more demanding.

Juri Kuronen Probabilistic graphical models

Page 10: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Markov properties in Markov networks

X1

X2

X3

X4

X5

X6

X7

Figure: An example of an undirected graph.

The edge set E of a Markov network implies the following Markovproperties (in the context of Markov networks, the set of neighborsof node j is called the Markov blanket of j , denoted by mbpjq):

Pairwise Markov property: Xj KK Xj1 | XV ztj,j1u for all pj , j 1q R E .

Local Markov property: Xj KK XV ztjYmbpjqu | Xmbpjq for all j P V .

Global Markov property: XA KK XB | XS for all disjoint subsets

A,B,S Ă V for which S separates A from B.

Juri Kuronen Probabilistic graphical models

Page 11: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Factorization in Markov networks

The indepence properties of the joint probability distributionppXV q encoded in graph G imply that ppXV q factorizes as aproduct of potential functions over the structure of G :

ppXV q “1

Z

ź

CPCpGqφC pXC q,

where CpG q contains the cliques of graph G , φC pXC q is a functiontaking real values and Z is a normalizing constant, called thepartition function, defined as

Z “ÿ

xPX

ź

CPCpGqφC pxC q.

Inference in general unstructured Markov networks is difficult dueto the intractability of the partition function Z .

Juri Kuronen Probabilistic graphical models

Page 12: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Factorization in Markov networks

X1

X2

X3

X4

X5

X6

X7

Figure: An example of an undirected graph.

Factorization in Markov networks:

ppX1, . . . ,X7q “

1

ZφpX1,X2qφpX1,X3qφpX2,X4qφpX3,X4qφpX4,X5,X6qφpX5,X6,X7q

Juri Kuronen Probabilistic graphical models

Page 13: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Structure learning of graphical models

Although the ultimate goal of a graphical model is toefficiently represent a multivariate distribution, the graphalone is also useful for gaining insight into complexdependency patterns among large collections of variables.

The structure learning problem is hard: the space of graphs isa combinational space consisting of a superexponentialnumber of structures – 2Opd

2q.

For example, the number of undirected graph structures with10 nodes is already 35 184 372 088 832.

Structure learning algorithms can generally be classified intotwo broad categories, they either use a constraint-based or ascore-based approach to optimize the network topology.

Juri Kuronen Probabilistic graphical models

Page 14: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Constraint-based structure learning

Constraint-based approaches infer the structure through aseries of statistical independence tests.

The local nature of the tests makes this an attractiveapproach from the computational scalability perspective.

However, a particular drawback is that the individual tests aresensitive to noise which can result in incorrect independenceassumptions.

For example, an edge discarded by an earlier test can nolonger be recovered.

Juri Kuronen Probabilistic graphical models

Page 15: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Score-based structure learning

Score-based approaches operate globally by formulating thestructure learning problem as an optimization problem.

This requires a scoring function by which the plausibility ofthe different candidates can be evaluated.

Additionally, this requires an optimization algorithm forfinding high-scoring graphs since an exhaustive evaluation is ingeneral infeasible.

Juri Kuronen Probabilistic graphical models

Page 16: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Bayesian–Dirichlet score for Bayesian networks

Let’s start by introducing a Bayesian approach for learningBayesian networks (Heckerman et al., 1995).

In the Bayesian approach, we are of course interested in theposterior probability:

ppG | XV q “ppXV | G qppG q

ppXV q.

A natural approach would be to take maximum likelihood ofppXV | G q. However, we run into overfitting problems.

Instead, ppXV | G q is the marginal likelihood, evaluated as

ppXV | G q “

ż

ppXV | θG ,G qppθG | G qdθG .

The model averaging in the marginal likelihood is a Bayesianway to deal with overfitting.

Juri Kuronen Probabilistic graphical models

Page 17: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Bayesian–Dirichlet score for Bayesian networks

With some assumptions, the integral in the marginallikelihood can be solved in closed form.

First, assume that the data is a multinomial sample from aBayesian network such that we have

θjkl “ ppxj “ l | xpapjq “ kq

for all l “ 1, . . . , rj and k “ 1, . . . , qj , where rj “ |Xj | andqj “ |Xparentspjq|.

We can then write

ppxV | θG ,G q “dź

j“1

qjź

k“1

rjź

l“1

θnjkljkl .

Here, njkl is the number of observations where xj is in state land its parents are in state k.

Juri Kuronen Probabilistic graphical models

Page 18: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Bayesian–Dirichlet score for Bayesian networks

Three key assumptions are related to parameter independence(PI). The parameters associated with each variable (global PI)and parameters associated with each state of the parents of avariable (local PI) are assumed independent. Finally, it’sassumed that ppθjk | G1q “ ppθjk | G2q if Xj has the sameparents in graphs G1 and G2 (parameter modularity).

This allows us to write

ppθG | G q “dź

j“1

ppθj | G q “dź

j“1

qjź

k“1

ppθjk | G q,

where θG “Ťd

j“1 θj , θj “Ťqj

k“1 θjk and θjk “Ťrj

l“1 θjkl .

The PI assumptions don’t always hold in practice, but theyare required from a computational convenience perspective.

Juri Kuronen Probabilistic graphical models

Page 19: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Bayesian–Dirichlet score for Bayesian networks

Finally, assuming that θjk „ Dirichletpαjk1, . . . , αjkrj q, we get

ppXV | G q “dź

j“1

qjź

k“1

Γpαjkq

Γpαjk ` njkq

rjź

l“1

Γpαjkl ` njklq

Γpαjklq.

Here αjk “řrj

l“1 αjkl and njk “řrj

l“1 njkl .

Note that the above factorizes straightforwardly asppXV | G q “

śdj“1 ppXj | Xparentspjq,G q. Thus, we can write:

ppG | XV q “ const ¨ ppG q ¨dź

j“1

ppXj | Xparentspjq,G q.

The constant is 1ppXV q

that disappears when comparing graphstructures.

Juri Kuronen Probabilistic graphical models

Page 20: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Score-based structure learning of Markov networks

Factorizing the score into conditional distributions is not asstraightforward for Markov networks due to the partitionfunction.

For this reason, the earliest score-based methods were limitedto models which constrained the underlying graph to bechordal, since such graphs can be perfectly represented byBayesian networks.

However, the recent surge of pseudo-likelihood-based methodshas made learning of general, non-chordal Markov networkstructures possible.

Juri Kuronen Probabilistic graphical models

Page 21: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Pseudo-Likelihood

In the pseudo-likelihood, introduced originally by Besag(1975), the joint probability of an outcome is replaced by aproduct of variable-wise conditional distributions:

ppXV | G q «dź

j“1

ppXj | XV zj ,G q.

Under certain assumptions which generally hold if we assumethat the data was generated from a Markov network, thepseudo-likelihood is a consistent estimator of the modelparameters (Koller and Friedman, 2009).The major advantage of this approximation is that the fullconditional distributions for each variable have a surprisinglysimple form:

j“1

ppXj | XV zj ,G q “dź

j“1

ppXj | Xmbpjq,G q.

Juri Kuronen Probabilistic graphical models

Page 22: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Derivation of ppXj | XV zj ,G q “ ppXj | Xmbpjq,G q

Partition CpG q into two disjoint sets Cj “ tC Ď CpG q : j P Cuand Czj “ tC Ď CpG q : j R Cu. That is, CpG q “ Cj Y Czj . Now,

ppxj | XV zj ,G q “ppxj , xV zj | G q

ř

xjppxj , xV zj | G q

“ppXV | G q

ř

xjppXV | G q

1Z

“ś

CPCj φpxC q‰“ś

CPCzj φpxC q‰

1Z

ř

xj

“ś

CPCj φpxC q‰“ś

CPCzj φpxC q‰ “

ś

CPCj φpxC qř

xj

ś

CPCj φpxC q

The key observation is that the problematic normalizingconstant Z has disappeared, replaced with local normalizingconstants that are easily computed. Notice also that the lastexpression contains only factors involving Xj and itsneighboring variables (the Markov blanket of Xj), allowingefficient computing. This, in passing, demonstrates the LocalMarkov Property.

Juri Kuronen Probabilistic graphical models

Page 23: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Marginal Pseudo-Likelihood (MPL)

With the pseudo-likelihood, similar formulation as in theBayesian–Dirichlet score can be used for Markov networkstructure learning: the parameters θjkl are instead definedwith respect to ppxj “ l | xmbpjq “ kq.Additionally, define the prior ppG q in terms of mutuallyindependent prior beliefs on the individual Markov blankets:

log ppG q “dÿ

j“1

log ppmbpjq | G q.

Now, the MPL score is (with some abuse of notation):

log ppG | XV q “

dÿ

j“1

MPLpj | mbpjqq

dÿ

j“1

log ppXj | Xmbpjq,G q ` log ppmbpjq | G q‰

.

Juri Kuronen Probabilistic graphical models

Page 24: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

MPL optimization problem

Since a graph can be defined by its collection of Markovblankets, we are left with the following optimization problem:

argmaxtmbpjqudj“1

«

dÿ

j“1

MPLpj | mbpjqq

ff

.

For a globally consistent graph structure, the above is subjectto j 1 P mbpjq ô j P mbpj 1q for all j , j 1 P V .

Due to the vast discrete optimization space, the optimizationproblem is clearly intractable for large systems.

Juri Kuronen Probabilistic graphical models

Page 25: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Two-phase search algorithm

Pensar et al. (2017) utilize a two-phase search algorithm to solvethe problem:

In the first phase, the restriction in the problem is relaxed,resulting in d independent Markov blanket discovery problems.

Each Markov blanket can be learned using a greedyhill-climbing algorithm that is based on two basic operations.

At each iteration, the algorithm adds to the Markov blanketthe node that induces the greatest score improvement.

Each addition step is interleaved with a deletion phase, wherethe algorithm instead chooses a node to delete if it results in ascore improvement.

The d solutions can be combined into a consistent solution as:

E_ “ tpj , j1q P V ˆ V : j P mbpj 1q _ j 1 P mbpjqu.

Juri Kuronen Probabilistic graphical models

Page 26: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Local MPL optimization

Algorithm 1: Hill-climbing algorithm to approximatively solve xmbpjq.

mbpjq, xmbpjq Ð ∅while xmbpjq has changed do

mbpiq Ð xmbpjqforeach j 1 P V ztmbpjq Y pjqu do

if MPLpj | mbpjq Y j 1q ą MPLpj | xmbpjqq thenxmbpjq Ð mbpjq Y j 1

while xmbpjq has changed & |xmbpjq| ą 2 do

mbpjq Ð xmbpjqforeach j 1 P mbpjq do

if MPLpj | mbpjqzj 1q ą log ppj | xmbpjqq thenxmbpjq Ð mbpjqzj 1

return xmbpiq

Juri Kuronen Probabilistic graphical models

Page 27: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Final optimization phase

In the second phase, the original problem is solved withrespect to the reduced model space G_ from E_, which is ingeneral considerably smaller than G.

Considering the first phase solution as a prescan that identifieseligible edges, another hill-climbing procedure is applied onE_ that in each iteration chooses the highest scoringneighboring graph structure (differing by 1 edge).

The collection of neighboring graph structures is denoted byNGpG q.

Because of the variable-wise factorization, local edge changesin this second phase cause a recalculation of the score for onlytwo variables, meaning that each iteration can be carried outefficiently by caching the edge-wise score differences.

Juri Kuronen Probabilistic graphical models

Page 28: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Marginal Pseudo-Likelihood (MPL)

Algorithm 2: Global hill-climbing algorithm.

G , G Ð ∅while G has changed do

G Ð Gforeach G 1 P NGpG q do

if ppXV | G1q ą ppXV | G q then

G Ð G 1

return G

Juri Kuronen Probabilistic graphical models

Page 29: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

Summary of MPL

Pseudo-likelihood is used to decompose the score into dvariable-wise scores.

The score can be defined similarly to the Bayesian–Dirichletscore, allowing learning of general, non-chordal Markovnetworks.

Can naturally break the graph optimization into dsubproblems.

Smart search strategy utilizing two sequential hill-climbingalgorithms.

Juri Kuronen Probabilistic graphical models

Page 30: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

References

Besag, J. Statistical analysis of non-lattice data. Journal of theRoyal Statistical Society. Series D (The Statistician), 24: 179–195,1975.

Blalock, H. M. Causal Models in the Social Sciences. Macmillan,London, 1971.

Gibbs, J. W. Elementary Principles in Statistical MechanicsDeveloped with Especial Reference to the Rational Foundation ofThermodynamics. Yale University Press, 1902.

Heckerman, D., Geiger, D. and Chickering, D. M. LearningBayesian Networks: The Combination of Knowledge and StatisticalData. Machine Learning, 20:197–243, 1995.

Koller, D., Friedman, N., Getoor, L. and Taskar, B. Graphicalmodels in a nutshell. In: Getoor, L. and Taskas, B. Introduction toStatistical Relational Learning. MIT Press, 2007.

Juri Kuronen Probabilistic graphical models

Page 31: Probabilistic graphical modelsfolk.uio.no/geirs/STK9200/Juri_Graphical.pdf · Graphs in the context of graphical models In mathematics, and more speci cally in graph theory, a graph

References

Koller, D. and Friedman, N. Probabilistic Graphical Models:Principles and Techniques. MIT Press, 2009.

Lauritzen, S. L. and Spiegelhalter, D. J. Local computations withprobabilities on graphical structures and their application to expertsystems (with discussion). Journal of the Royal Statistical Society,Series B (Methodological), 50:157–224, 1988.

Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networksof Plausible Inference. Morgan Kaufmann, San Mateo, CA, 1988.

Pensar, J., Nyman, H., Niiranen, J. and Corander, J. MarginalPseudo-Likelihood Learning of Discrete Markov NetworkStructures. Bayesian Analysis 12(4):1195–1215, 2017.

Wold, H. Causality and Econometrics. Econometrica, 22:162–177,1954.

Wright, S. Correlation and Causation. Journal of AgriculturalResearch, 20:557–585, 1921a.

Juri Kuronen Probabilistic graphical models