Download pdf - Three Concepts: Probability © Henry Tirri, Petri Myllymäki ... · log ( | ) log ( | , , , ) log ( | ) log ( | , ) log ( | , , ) p D S p S p S p S p S h l l h l m h = = + + +-= x

149Three Concepts: Probability © Henry Tirri, Petri Myllymäki 1998-2006

Learning Bayesian Networks

23.02.2006


Learning Bayesian Networks

p S p x Ssh

i i ih

i

n

( | , ) ( | , , )x paθ θ==

∏1

localdistributionfunctions

joint distribution function

Sh is the hypothesis that structure S (minimal) encodes the joint distribution function


Example

θ θ θ θs = { , , }| |1 2 1 2 1

X1 X2S:

p x x S p x S p x x Ssh h h( , | , ) ( | , ) ( | , , )

( )|

|

1 2 1 1 2 1 2 1

1 2 11

θ θ θθ θ

=

= −

p x x S p x S p x x Ssh h h( , | , ) ( | , ) ( | , , )|

|

1 2 1 1 2 1 2 1

1 2 1

θ θ θθ θ

==


Updating or "Learning" Parameters(Sh is certain, θθθθs is uncertain)Let be a random sample from

D

p Sm

sh

= { , , }

( | , ).

x x

x1 K

θ

p D S p S p D S

p S p S

sh

sh

sh

sh

l sh

l

m

( | , ) ( | ) ( | , )

( | ) ( | , )

θ θ θ

θ θ

∝

==

∏ x1

∫ ++ = sh

sh

smh

m dSDPSDxPSDxP θθθ ),|(),,|(),|( 11


Example

sample 1

sample 2

M

Θ2|1Θ1 Θ2|1

X1 X2

X1 X2

θ θ θ θs = { , , }| |1 2 1 2 1

X1 X2S:


Exact computation of p(θθθθs|D,Sh)

� No missing data, no hidden variables

� Independent parameters

� Local distribution functions from the exponential family, conjugate priors

p S p Ssh

ih

i

n

( | ) ( | )θ θ==

∏1


Parameter Independence

sample 1

sample 2

M

Θ2|1Θ1 Θ2|1

X1 X2

X1 X2


No Missing Data

Each parameter can be updated independently….

sample 1

sample 2

M

Θ2|1Θ1 Θ2|1

X1 X2

X1 X2

P N N S ph N N( | , , ) ( ) ( )| | | |θ θ θ θ2 1 21 2 1 2 1 2 1 2 121 2 11∝ −


Updating Parameters

x1 : true false x2 : false true x3 : true true

X1 X2

Data (D)

x4 : true true x5 : false true x6 : true false

p D p( | ) ( )θ θ θ θ1 1 14

121∝ −( )


Updating Parameters

p D p( | ) ( )θ θ θ θ2 1 2 1 2 12

2 121| | | |( )∝ −x1 : true false

x2 : false true x3 : true true

X1 X2

Data (D)

x4 : true truex5 : false true x6 : true false


Updating Parameters

p D p( | ) ( )θ θ θ θ2 1 2 1 2 12

2 101| | | |( )∝ −x1 : true false

x2 : false truex3 : true true

X1 X2

Data (D)

x4 : true true x5 : false truex6 : true false


Conjugate Priors

Θ2|1Θ1 Θ2|1

X1 X2

X1 X2

Beta( 1 1θ α| )

M

Beta( 2|1 2|1θ α| )

Beta( 2| 1 2| 1θ α| )


Missing Data

x1 : ? true X1 X2

Data

p x p x x p x x p x x p x x( ) =θ θ θ2 1 2 2 1 1 2 1 2 2 1 1 2 1 2| | || ( | , ) ( | ) ( | , ) ( | )+

Beta Beta

Mixing coefficients

For real problems, exact calculations are intractable.


Approximate computation of p(θθθθs|D,Sh)� Monte Carlo: Gibbs sampling, importance sampling (e.g., Neal 93)

� Gaussian approximation (e.g., Kass et al. 88)

� MAP: ~θs

→)~()~(

21

),~|(),|(ss

tss Ah

smh

s eSDcpSSpθθθθ

θθ−−−∞→


Gibbs Sampling Geman and Geman (1984)

Basic idea:Given = { and ( ), estimate ( ( )) as follows:

Initialize somehow Repeat

- For = , ,

Sample each according to ( | \ )

Compute using current values of

Average of ( ) regardless of the initial .

X x x p X f X

X

i n

x p x X x

f X X

f X f X X

n

i i i

1

1

, , } E

( )

E( ( )),

K

K

••

→


••

→

Initialize Repeat

- Sample missing data (given and producing

- Compute

- Sample from

Average of

θ

θθ

θ θ

θ θ

s

s l lc

s lc h

s s lc h

s lc h

s lh

D D

p D S

p D S

p D S p D S

)

( | , )

( | , )

( | , ) ( | , )

Using Gibbs Sampling to Approximate p(θθθθ|D,Sh)


Gaussian ApproximationTierney and Kadane (1986), Kass et al. (1988)

Basic idea: for large data sets, will often be ~Gaussianp D Ss

h( | , )θ

θ s

p D Ssh( | , )θ


Gaussian Approximationg p D p

g g

g g A A g

p D p p D p e

T

AT

( ) log ( | ) ( )

( ) ( ~):

( ) ( ~) ( ~) ( ~) ( ' ' ( ~))

( | ) ( ) ( | ~) ( ~)( ~ ) ( ~ )

θ θ θθ θ

θ θ θ θ θ θ θ

θ θ θ θθ θ θ θ

≡

≈ + − − − = −

≈− − −

Expand about its maximum 12

12


Computational Considerations� Find �Gradient methods�Monte-Carlo methods�EM, if p(D|θθθθs,Sh) is in the exponential family

� Compute �Numerical methods (Meng and Rubin 91) �Likelihood ratio tests (e.g., Raftery 94) �Via inference, if variables discrete (Thiesson 95)

g''( ~)θ

~θ


Expectation-Maximization (EM) Algorithm (Dempster et al. (1977))� Initialize parameters� Expectation step: compute the expected sufficient statistics

� Maximization step: choose parameters so as to maximize their posterior probability given the expected sufficient statistics

� Iterate. Parameters will converge to a local MAP value

E N p x xs l sl

N

( | ) ( , | , )12 1 21

θ θ==∑ x

θθ α

θ α θ α2 121 21

21 21 2 1 2 1

11 1| :

( | )( | ) ( | )

=+ −

+ − + + −E N

E N E Ns

s s


Learning Structure(both Sh and θθθθs are uncertain)

?X1 X2

X1 X2

Sa:

Sb:

)|(),|(

)|(),|()|(

1

11

DSpDSxp

DSpDSxpDxphb

hbm

ha

hamm

+

++ +=

model averaging


Learning Structure(both Sh and θθθθs are uncertain)

p S D p S p D S

p S p D S p S d

h h h

hs

hs

hs

( | ) ( ) ( | )

( ) ( | , ) ( | )

∝

= ∫ θ θ θ

Parameters updated as beforeParameters updated as before

marginallikelihood


Model Selection

� The number of possible structures for a given domain is more than exponential in the number of variables� Solution: Use only one or a handful of models� Necessary components: �Search method�Scoring method


One Reasonable Score: Posterior Probability of a Structure

p S D p S p D S

p S p D S p S d

h h h

hs

hs

hs

( | ) ( ) ( | )

( ) ( | , ) ( | )

∝

= ∫ θ θ θ

structureprior

parameterprior

likelihood


Relation to cross-validation(Dawid 84)

log ( | ) log ( | , , , )

log ( | ) log ( | , ) log ( | , , )

p D S p S

p S p S p S

hl l

h

l

m

h h h

=

= + + +

−=∑ x x x

x x x x x x

1 11

1 2 1 3 1 2

K

L

Obs! The last term in the sumis related to cross-validation


Predictive ScoreSpiegelhalter et al. (1993)

pred(S p y y y Shl l l l

h

l

m

) log ( | ,{ , }, ,{ , }, )= − −=∑ x x x1 1 1 1

1

K

Ydisease

X1

X2

Xn symptoms...


Exact computation of p(D|Sh)

� No missing data, no hidden variables

� Independent parameters

� Local distribution functions from the exponential family, conjugate priors� (Prior modularity)

p S p Ssh

ih

i

n

( | ) ( | )θ θ==

∏1


Prior Modularity

The parameter priors are modular.That is, these quantities for Xi depend only on the structure that is local to Xi (i.e., Pai) and not all of S.

p S p Sih

ih( | ) ( | )θ θ1 2=

If Xi has the same parents in S1 and S2, then


Example: All Local Distributions Multinomial

p Sx

hx

ki ij

ik

ij

ijk( | )| |

θ θ αpa pa

∝ −∏ 1Conjugate prior:

p x Sik

ij

ih

xik

ij( | , , )

|pa

paθ θ=

Local distribution functions:


Bayesian Dirichlet ScoreCooper and Herskovits (1991)

p D SN

Nh ij

ij ijj

q

i

nijk ijk

ijkk

ri i

( | )( )

( )

( )

( )=

++

== =∏∏ ∏

ΓΓ

ΓΓ

αα

αα11 1

N X x

r X

q X

N N

ijk i i ij

i i

i i

ij ijkk

r

ij ijkk

ri i

:

::

# cases where = and =

number of states of number of instances of parents of

ik Pa pa

α α= == =∑ ∑

1 1


� Monte Carlo � Gaussian approximation

� Bayesian Information Criterion (BIC)log ( | ) log ( | , $ ) ( / ) log( ) ( )p D S p D S d m Oh h

s= − +θ 2 1

log ( | ) log ( | , ~ ) log ( ~ | )

( / ) log( ) ( / ) log | | ( )

p D S p D S p S

d A O m

h hs s

h= +

+ − + −

θ θ

π2 2 1 2 1

Approximations for p(D|Sh)


Bayesian Information Criterion (BIC) Schwarz (1978)

� No priors needed� BIC = - approximation of stochastic complexity (Rissanen 87)

log ( | ) log ( | , $ ) logp D S p D Sd

mh hs= −θ

2


Another ApproximationCheeseman & Stutz (1995)

� Accuracy (CH96): Gaussian > CS >> BIC� As efficient as BIC

log ( | ) log ( | )p D S p D Sh h≈ EM

p D SE N

E NEM

h ij

ij ijj

q

i

nijk ijk

ijkk

ri i

( | )( )

( ( ))

( ( ))

( )≈

++

== =∏∏ ∏

ΓΓ

ΓΓ

αα

αα11 1


Problems When There Are Many Possible Structures� Prior assessment� Search


Assumptions for ConstructingParameter Priors Geiger and Heckerman (1995)

Relatively small# of assessments

Parameter priorsfor all structuresin a given domain

⇒


Assumptions That Simplify Assessment of Priors� Parameter independence� Prior modularity� (Conjugate priors)� Marginal likelihood equivalence


Distribution Equivalence

Suppose the local likelihoods are restricted to some family F (e.g., discrete, linear regression).Two network structures for X are distribution equivalent wrt F if they encode the same distributions on X.


Independence and Distribution Equivalence

S1, S2 distribution eqv wrt some F ⇒ S1, S2 independence eqvbut the converse does not hold for all F. E.g.:

p x S

a b xi i

ji

h

i ijx

jj i

( | , , )

exp

pa

pa

θ =

+ +�

��

��

��

��

∈∑

1

1

X1

X2 X3

X1

X2 X3

and are not distribution equivalent


Assumption:Covered-Arc-Reversal EquivalenceGiven local likelihoods restricted to F, any two structures for X that differ only by a covered arc reversal are distribution equivalent.Examples:- unrestricted discrete- linear regression (e.g., Shachter and Kenley)

S1, S2 independence eqv⇒ S1, S2 distribution eqv wrt F


θθθθx: Parameters for the Joint Likelihood

X1 X2 X3

X1 X3 X2

X2 X1 X3

X2 X3 X1

X3 X1 X2

X3 X2 X1

p Sch( | )θ x

p Sch( | , )x xθ


θθθθx: Discrete Case

p S p x x Sch

n ch

x xn( | , ) ( , , | , ) , ,x x xθ θ θ= =1 1

K �

θ θx x x x xi

n

n i i1 1 11

, , | , ,� �=−

=∏

∂θ∂θ

θx

scx x x

r

x xi

n

i i

j in

j

i

=−

= + −

=

−

∏∏ [ ]| , ,[ ]

, ,1 1

1

1

1

1

1

Π


θθθθx: Linear-Regression Casep S Wc

hn( | , ) ( | , )x N xxθ µ=

µ µi i ji jj

i

m b= +=

−∑1

1

W iW i

v v

v v

i it

i

i

i

i

i i

( )( )

+ =+

−

−

+ +

+

+

+

+

+ +

1 1

1 1

1

1

1

1

1 1

b b b

b


Discrete Case: Dirichlet Priors are Inevitable

p Sch

x xp x x S

x xn

n ch

n

( | ) , ,( , , | )

, ,

θ θαx = ⋅ −∏ 1

1

1

1K

K

K

parameter independence

given:likelihood equivalence


Discrete Case

⇓ parameter independenceparameter independencelikelihood equivalencelikelihood equivalence

X1 X2 X1 X2

f f f f f fy x1 1 21 2 1 2 1

1 1

2 2 12 12 12 12

2 21 1( ) ( ) ( )

( )( ) ( ) ( )

( )| | | | | | | |

θ θ θθ θ

θ θ θθ θ−

=−


Assessments for the BDeScore

is Dirichlet

p S ch( | )θ x

- equivalent sample size

-

αp X X Sn c

h( , , | )1 K

prior Bayesian networkfor {X1,...,Xn}


BDe ScoreHeckerman, Geiger, and Chickering(1994)

p D SN

Nh ij

ij ijj

q

i

nijk ijk

ijkk

ri i

( | )( )

( )

( )

( )=

++

== =∏∏ ∏

ΓΓ

ΓΓ

αα

αα11 1

α ijk i ik

i ik

chp X x S= = =( , | )Pa pa

“Equivalent structures get equal scores.”


Linear-Regression Case:Normal-Wishart Prior is Inevitable

p S Wch

n W( | ) ( | , , , )θ θ µ α αµx xNw= 0 0

parameter independence

given:likelihood equivalence


Combine Domain Knowledge and Data

x1 : true false true . . . truex2 : false true true . . . false

xN : true true false . . . true

X1 X2 X3 . . . Xn

...

X1

X4X3

X2

Prior Network Equivalent Sample Size Data

α

X1

X4

X2

X3

Priors for all structures

Learned structure


Structure Priors

p S D p S D p S p D Sh h h h( | ) ( , ) ( ) ( | )∝ =

structure prior


Structure Prior

δ is the number of arcs that differ between

Sh and the prior network

p S h( ) ,∝ < <κ κδ 0 1


Search Methods

��

� � � � � � �� ! � � � �" � � ��# $ %& � � � �� '( )

* � � � � � ��

+, - � � � �/. %�0 132 � � � � � � � � )

+ 4� � �2 " � � � � ��

+ 5 � �� 2 & � � 6 � ��

7� � � � � �� 8 9: - �; �=< � >� �; � 6� �� 6 � � ��

+ 7� � � �� 5 � � � % $ ' '( )@? & � � � �� % $ ' 'A )

& � � � 6�� ; �=< �� 6�� score( , ) ( , , )S D s X Di i i

i

n

==

∏ Pa1


score( , ) log ( , , )

( , , ) log ( , , )

S D s X Pa D

w X Pa D s X D

hi i i

i

n

i ii

n

i ii

n

=

= + ∅

=

= =

∑∑ ∑

1

1 1

w X X D s X X D s X Di j i j i i i( , , ) log ( , , ) log ( , , )≡ − ∅

Special Case: Each Node Has at Most One Parent


Each Node Has at Most One Parent

The network with the highest probability is theone for which Σ w(Xi , Pai, D) is a maximum

• Scores without likelihood+prior equivalence:Maximum branchings (Edmonds)

• Scores with likelihood+prior equivalence:Maximum spanning tree


General Case: Each Node Has at Most kParents

Finding best structure is NP-hard for k>1� Greedy (+/- restarts)� Best-first search� Simulating annealing� Other Monte-Carlo approaches


Local Search

scoreall possible

single changes

anychangesbetter?

performbest

change

yes

no

returnsaved structure

initializestructure

• prior structure• empty graph• max spanning tree• random graph

extension:multiple restarts


Simulated Annealing (Metropolis)

pick random changeand compute:

p=exp(∆∆∆∆score/T)

quit?

performchange

with probp

yes

returnsaved structure

initializestructure

• prior structure• empty graph• max spanning tree• random graph

no


Evaluation Methodology

goldstandardnetwork

learnednetwork(s)

priornetwork

datarandomsample

noise (ηηηη)

Two measures of utility of learned network:- Cross Entropy (gold standard network, learned network)- Structural difference

score +search


R

AA

A

A

RR

R

D

D

DD

D

D

R

Gold Standard

Prior Network

Learned Network

Data


D

Gold Standard

Prior Network (no arcs)

Learned Network

Data (10,000 cases)

Search over equivalence classesSpirtes and Meek (1995)


Learning with Hidden VariablesSpecial case of learning with missing dataE.g.: AutoClass (Cheeseman)

H

X1 X2 Xn...

observed

unobserved


Identifying Hidden Variables

� Prior knowledge (e.g., AutoClass)� Dependency cliques