33
1 Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks • Naïve Bayes assumes that values of its attributes a 1 ,..,a n are conditionally independent given the target value v Independence assumption is overly restrictive •Bayesian Belief Networks provide an intermediate approach that is less constraining more tractable than avoiding conditional independence altogether

Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

1

Machine Learning, Chapter 6 CSE 574, Spring 2004

Bayesian Belief Networks

• Naïve Bayes assumes that values of its attributes a1,..,an are conditionally independent given the target value v

• Independence assumption is overly restrictive

•Bayesian Belief Networks• provide an intermediate approach that is less constraining • more tractable than avoiding conditional independence

altogether

Page 2: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

2

Machine Learning, Chapter 6 CSE 574, Spring 2004

Statistically dependent and independent variables

Page 3: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

3

Machine Learning, Chapter 6 CSE 574, Spring 2004

Bayesian Belief Network

•Describes probability distribution governing a set of variables • by specifying a set of conditional independence assumptions • along with a set of conditional probabilities

•Allows conditional independence assumptions to apply to subsets of the variables•Less constraining than the global assumption of conditional independence made by the Naïve Bayes classifier

Page 4: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

4

Machine Learning, Chapter 6 CSE 574, Spring 2004

Probability Distribution over a set of variables

•Random variables

•Each variable can take on the set of possible values

•Joint space of variables is cross-product

•Each item in joint space corresponds to one of the possible values of

•Bayesian Belief Network specifies the joint probability distribution

nYY ,..,1

iY )( iYV

)(..)()( 21 nYVYVYV ××

nYY ,..,1

Page 5: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

5

Machine Learning, Chapter 6 CSE 574, Spring 2004

Conditional Independence

Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value for Z,that is

which can be written more compactly as

)|()|(),,( kikikji zZxXPzYxXPzyx =====∀

)|(),|( ZXPZYXP =

Page 6: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

6

Machine Learning, Chapter 6 CSE 574, Spring 2004

Conditional Independence, continued

• Naïve Bayes • Assumes instance attribute is conditionally

independent of instance attribute given target value V

• Allows Naïve Bayes to calculate

)|(),|()|,( 22121 VAPVAAPVAAP =

)|()|( 21 VAPVAP=

2A1A2A

Page 7: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

7

Machine Learning, Chapter 6 CSE 574, Spring 2004

Bayesian Belief Network Example

• Boolean Variables (present or absent)• Storm• BusTourGroup• Lightning• Campfire• Thunder• ForestFire

• Specify conditional probabilities between terms

Page 8: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

8

Machine Learning, Chapter 6 CSE 574, Spring 2004

Bayesian Belief Network

Y1 Y2 ConditionalProbabilityTable forvariableY4

Y3

Y5 Y6

Page 9: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

9

Machine Learning, Chapter 6 CSE 574, Spring 2004

Probabilities stored in Bayesian Network

• Parents(Yi) denotes set of immediate predecessors of Yi in the network

• are the values stored in the conditional probability table associated with node Yi

))Y(Parents|y(P ii

Page 10: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

10

Machine Learning, Chapter 6 CSE 574, Spring 2004

Representation

• Bayesian network represents joint probability distribution

• Joint probability for any desired assignment of valuesto the tuple of variables

is computed by

))(|(),..,(1

1 i

n

iin YParentsyPyyP ∏

=

=

>< nyy ,..,1 >< nYY ,..,1

Page 11: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

11

Machine Learning, Chapter 6 CSE 574, Spring 2004

Example for Joint Probability Calculation

Page 12: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

12

Machine Learning, Chapter 6 CSE 574, Spring 2004

Joint Probability Calculation

• P(a3,b1,x2,c3,d2)• = P(a3) P(b1) P(x2/a3,b1) P(c3/x2) P(d2/x2)

=0.25 x 0.6 x 0.4 x 0.5 x 0.4=0.012

Page 13: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

13

Machine Learning, Chapter 6 CSE 574, Spring 2004

Representation

• Bayesian network allows causal knowledge to be represented• Lightning causes Thunder• Once we know value of Storm and BusTourGroup no

information about Campfire provided by Lightning and Thunder

Page 14: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

14

Machine Learning, Chapter 6 CSE 574, Spring 2004

Causal relationships

• State of automobile• Temperature of engine• Pressure of brake fluid• Pressure of air in tires• Voltages in the wires

• Oil pressure and air pressure are not causally related• Engine temperature and oil temperature are

Page 15: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

15

Machine Learning, Chapter 6 CSE 574, Spring 2004

Representation• Arcs represent assertion that variable is conditionally

independent of its non-descendants given its parents• Thunder is conditionally independent of other variables given

the value of Lightning• Pr (T,L / F) • = Pr(T / L, F)Pr(L / F) by chain rule• = Pr(T/ L)Pr(L / F) by network

Page 16: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

16

Machine Learning, Chapter 6 CSE 574, Spring 2004

Inference Tasks1. Infer the value of some target variable given the observed values

of other variables• Pr(F/S,B,L,C,T)• =Pr(F,T,B/S,L,C)/Pr(T,B/S,L,C)• =Pr(F/S,L,C)Pr(T,B/S,L,C)/Pr(T,B/S,L,C)

2. Infer the probability distribution of the target variable3. Infer subset of variables when some other variables are known

Page 17: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

17

Machine Learning, Chapter 6 CSE 574, Spring 2004

Inference Efficiency• Exact Inference for arbitrary networks is NP-hard• Approximate inference sacrifices precision to gain

efficiency• Monte Carlo method for randomly sampling unobserved

variables• Approximate methods can also be NP-hard but useful in

many cases

Page 18: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

18

Machine Learning, Chapter 6 CSE 574, Spring 2004

Learning Bayesian Networks

• Learning Bayesian Networks from training data• Several different settings for this problem1 Network structure

• given in advance or• inferred from training data

2 Network variables• are observable in training data or• might be unobservable

Page 19: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

19

Machine Learning, Chapter 6 CSE 574, Spring 2004

Learning Bayesian Networks

Network structure given in advance1. Variables are fully observable in training data• Estimate conditional probability table entries as for a NB

2. Only some variable values are observable• More difficult• Similar to learning weights for hidden units in ANNs

• ANNs have input and output values specified • Hidden unit values are learnt from training examples

Page 20: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

20

Machine Learning, Chapter 6 CSE 574, Spring 2004

Gradient Ascent for learning probabilities

• Learns entries in the conditional probability tables• Gradient Ascent searches through the space of all

possible entries for the conditional probability tables• Objective function maximized during ascent is the

probability P(D/h) of the observed training data given the hypothesis h

• Corresponds to searching for the maximum likelihood hypothesis

Page 21: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

21

Machine Learning, Chapter 6 CSE 574, Spring 2004

Gradient Ascent Training of Bayesian Networks

• Network structure given but only some variables are observable

• Learns entries in conditional probability tables• Searches through a space of hypotheses that

corresponds to the set of all possible entries for the conditional probability tables

• Objective function is maximized: P(D|h) probability of observed data given the hypothesis h

Page 22: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

22

Machine Learning, Chapter 6 CSE 574, Spring 2004

Gradient Ascent Training of Bayesian Networks

• Maximize P(D|h) with respect to the parameters that define the conditional probability tables of the Bayesian network

• wijk is a single entry in one of the tables in the known structure of the Bayesian network

• i = index of variable• j = index of value of variable• k = index of parent

Page 23: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

23

Machine Learning, Chapter 6 CSE 574, Spring 2004

Bayesian Network Gradient Ascent Notation

Y1 Y2 ConditionalProbabilityTable forvariableY4

Y3

wijk is the probability that Yi will take on value yij given that its immediate parents Ui take on the values given by uik

If wijk is top right entry of table, Yi is the variable Campfire,Ui are the parents-tuple <Storm,BusTourGroup>, yij=True, and uik=<False, False>

Page 24: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

24

Machine Learning, Chapter 6 CSE 574, Spring 2004

Gradient Ascent Training of Bayesian Networks

• Maximize P(D|h) by following the gradient of ln P(D|h)

• wijk is a single entry in one of the tables in the Bayesian network

• It can be shown that that each of these derivatives can be calculated as

∑∈

===

∂∂

Dd ijk

ikiiji

ijk w)d|uU,yY(P

w)h|D(Pln

Page 25: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

25

Machine Learning, Chapter 6 CSE 574, Spring 2004

Gradient Ascent calculation example

Y1 Y2 ConditionalProbabilityTable forvariableY4

Y3

To calculate derivative of ln p(D|h) with respect to upper-rightmost entry in table

we have to calculate P(Campfire=True,Storm=False,BusTourGroup=False|d) for each training example d in D

Page 26: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

26

Machine Learning, Chapter 6 CSE 574, Spring 2004

Gradient Ascent Training

• When variables are unobservable for training example d, • the required probability can be calculated from the observed

variables in d using standard Bayesian network inference

• Required calculations are easily performed during most Bayesian network inference• learning can be performed at little additional cost whenever

the Bayesian network is used for inference and new evidence is subsequently obtained

Page 27: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

27

Machine Learning, Chapter 6 CSE 574, Spring 2004

Derivation of Key Equation for Gradient Ascent Training

∑∈

===

∂∂

Dd ijk

ikiiji

ijk w)d|uU,yY(P

w)h|D(Pln

)d(Plnww

)D(Pln

Ddh

ijkijk

h ∏∈∂

∂=

∂∂P(D/h) is denoted

as Ph(D)

∑∈ ∂∂

=Dd ijk

h

wdP )(ln

∑∈ ∂

∂=

Dd ijk

h

h wdP

dP)(

)(1

Page 28: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

28

Machine Learning, Chapter 6 CSE 574, Spring 2004

Derivation of Key Equation for Gradient Ascent Training of Bayesian Networks, continued

),(),|()(

1)(ln,

kijihkijikj

hDd ijkhijk

h uyPuydPwdPw

DP′′′′

′′∈∑∑ ∂

∂=

∂∂

)()|(),|()(

1,

kihkijihkijikj

hDd ijkh

uPuyPuydPwdP ′′′′′

′′∈∑∑ ∂

∂=

Page 29: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

29

Machine Learning, Chapter 6 CSE 574, Spring 2004

Derivation of Key Equation for Gradient Ascent Training of Bayesian Networks, continued

)()|(),|()(

1)(lnikhikijhikijh

Dd ijkhijk

h uPuyPuydPwdPw

DP ∑∈ ∂

∂=

∂∂

)(),|()(

1ikhijkikijh

Dd ijkh

uPwuydPwdP∑

∈ ∂∂

=

)(),|()(

1ikhikijh

Dd h

uPuydPdP∑

=

Page 30: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

30

Machine Learning, Chapter 6 CSE 574, Spring 2004

Derivation of Key Equation for Gradient Ascent Training of Bayesian Networks, continued

ApplyingBayesTheorem ),(

)()()|,()(

1)(ln

ikijh

ikhhikijh

Dd hijk

h

uyPuPdPduyP

dPwDP ∑

=∂

),()()|,(

)(1

ikijh

ikhikijh

Dd h uyPuPduyP

dP∑∈

=

)|()|,(

)(1

ikijh

ikijh

Dd h uyPduyP

dP∑∈

=

∑∈

=Dd ijk

ikijh

wduyP )|,(

Page 31: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

31

Machine Learning, Chapter 6 CSE 574, Spring 2004

Gradient Ascent Training Procedure

∑∈

+←Dd ijk

ikijhijkijk w

duyPww

)|,(η

• Two-step Procedure1. Update each wijk using training data D

where η is the learning rate

2. Renormalize weights wijk to assure

10 ≤≤ ijkw1=Σ ijkjw

Page 32: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

32

Machine Learning, Chapter 6 CSE 574, Spring 2004

Properties of Algorithm for Gradient Ascent Training of Bayesian Networks

• Converges to locally maximum likelihood hypothesis • for the conditional probabilities in the Bayesian network

• Only finds a local optimum solution

• Alternative to gradient ascent • When not all variables are observable• EM algorithm - also finds local maximum likelihood solution

Page 33: Machine Learning, Chapter 6 CSE 574, Spring 2004srihari/CSE574/ChapBL/ChapBL.Part4.pdf · Machine Learning, Chapter 6 CSE 574, Spring 2004 Bayesian Belief Networks •N aïve Bayes

33

Machine Learning, Chapter 6 CSE 574, Spring 2004

Learning the Structure of Bayesian Networks

• Network structure is not known in advance• Bayesian scoring metric for choosing among alternative

networks

• Heuristic search algorithm K2 for learning network structure when data is fully observable

– Greedy search that trades network complexity for accuracy over training data

• Given 3000 training examples from a manually constructed Bayesian network containing 37 nodes and 46 arcs,

• K2 was able to reconstruct network almost exactly