Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course...

Preview:

Citation preview

1

Bayesian Networks

Tamara Berg

CS 560 Artificial Intelligence

Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan

Announcements

• HW3 released online this afternoon, due Oct 29, 11:59pm

• Demo by Mark Zhao!

Bayesian decision making

• Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E

• Inference problem: given some evidence E = e, what is P(X | e)?

• Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x1,e1), …, (xn,en)}

Probabilistic inference A general scenario:

Query variables: X Evidence (observed) variables: E = e Unobserved variables: Y

If we know the full joint distribution P(X, E, Y), how can we perform inference about X?

y

yeXe

eXeEX ),,(

)(

),()|( P

P

PP

Inference

• Inference: calculating some useful quantity from a joint probability distribution

• Examples:– Posterior probability:

– Most likely explanation:

6

B E

A

J M

Inference – computing conditional probabilities

Marginalization:Conditional Probabilities:

Inference by Enumeration

• Given unlimited time, inference in BNs is easy• Recipe:

– State the marginal probabilities you need– Figure out ALL the atomic probabilities you need– Calculate and combine them

• Example:

8

B E

A

J M

Example: Enumeration

• In this simple method, we only need the BN to synthesize the joint entries

9

Probabilistic inference A general scenario:

Query variables: X Evidence (observed) variables: E = e Unobserved variables: Y

If we know the full joint distribution P(X, E, Y), how can we perform inference about X?

Problems Full joint distributions are too large Marginalizing out Y may involve too many summation terms

y

yeXe

eXeEX ),,(

)(

),()|( P

P

PP

Inference by Enumeration?

11

Variable Elimination

• Why is inference by enumeration on a Bayes Net inefficient?– You join up the whole joint distribution before you sum

out the hidden variables– You end up repeating a lot of work!

• Idea: interleave joining and marginalizing!– Called “Variable Elimination”– Choosing the order to eliminate variables that

minimizes work is NP-hard, but *anything* sensible is much faster than inference by enumeration

12

General Variable Elimination

• Query:

• Start with initial factors:– Local CPTs (but instantiated by evidence)

• While there are still hidden variables (not Q or evidence):– Pick a hidden variable H– Join all factors mentioning H– Eliminate (sum out) H

• Join all remaining factors and normalize

13

14

Example: Variable elimination

Query: What is the probability that a student attends class, given that they pass the exam?

[based on slides taken from UMBC CMSC 671, 2005]

P(pr|at,st) at st0.9 T T0.5 T F0.7 F T0.1 F F

attend study

preparedfair

pass

P(at)=.8P(st)=.6

P(fa)=.9

P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F

15

Join study factors

attend study

preparedfair

pass

P(at)=.8P(st)=.6

P(fa)=.9

Original Joint Marginalprep study attend P(pr|at,st) P(st) P(pr,st|sm) P(pr|sm)

T T T 0.9 0.6 0.54 0.74T F T 0.5 0.4 0.2T T F 0.7 0.6 0.42 0.46T F F 0.1 0.4 0.04 F T T 0.1 0.6 0.06 0.26F F T 0.5 0.4 0.2 F T F 0.3 0.6 0.18 0.54F F F 0.9 0.4 0.36

P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F

16

Marginalize out study

attend

prepared,study

fair

pass

P(at)=.8

P(fa)=.9

Original Joint Marginalprep study attend P(pr|at,st) P(st) P(pr,st|at) P(pr|at)

T T T 0.9 0.6 0.54 0.74T F T 0.5 0.4 0.2T T F 0.7 0.6 0.42 0.46T F F 0.1 0.4 0.04 F T T 0.1 0.6 0.06 0.26F F T 0.5 0.4 0.2 F T F 0.3 0.6 0.18 0.54F F F 0.9 0.4 0.36

P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F

17

Remove “study”

attend

preparedfair

pass

P(at)=.8

P(fa)=.9

P(pr|at) pr at0.74 T T0.46 T F0.26 F T0.54 F F

P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F

18

Join factors “fair”

attend

preparedfair

pass

P(at)=.8

P(fa)=.9

P(pr|at) prep attend0.74 T T0.46 T F0.26 F T0.54 F F

Original Joint Marginal

pa pre attend fairP(pa|

at,pre,fa) P(fair)P(pa,fa|sm,pre)

P(pa|sm,pre)

t T T T 0.9 0.9 0.81 0.82t T T F 0.1 0.1 0.01 t T F T 0.7 0.9 0.63 0.64t T F F 0.1 0.1 0.01 t F T T 0.7 0.9 0.63 0.64t F T F 0.1 0.1 0.01 t F F T 0.2 0.9 0.18 0.19t F F F 0.1 0.1 0.01

19

Marginalize out “fair”

attend

prepared

pass,fair

P(at)=.8

P(pr|at) prep attend0.74 T T0.46 T F0.26 F T0.54 F F

Original Joint Marginal

pa pre attend fair P(pa|at,pre,fa) P(fair) P(pa,fa|at,pre) P(pa|at,pre)T T T T 0.9 0.9 0.81 0.82T T T F 0.1 0.1 0.01 T T F T 0.7 0.9 0.63 0.64T T F F 0.1 0.1 0.01 T F T T 0.7 0.9 0.63 0.64T F T F 0.1 0.1 0.01 T F F T 0.2 0.9 0.18 0.19T F F F 0.1 0.1 0.01

20

Marginalize out “fair”

attend

prepared

pass

P(at)=.8

P(pr|at) prep attend0.74 T T0.46 T F0.26 F T0.54 F F

P(pa|at,pre) pa pre attend0.82 t T T0.64 t T F0.64 t F T0.19 t F F

21

Join factors “prepared”

attend

prepared

pass

P(at)=.8

Original Joint Marginalpa pre attend P(pa|at,pr) P(pr|at) P(pa,pr|sm) P(pa|sm)t T T 0.82 0.74 0.6068 0.7732t T F 0.64 0.46 0.2944 0.397t F T 0.64 0.26 0.1664 t F F 0.19 0.54 0.1026

22

Join factors “prepared”

attend

pass,prepared

P(at)=.8

Original Joint Marginalpa pre attend P(pa|at,pr) P(pr|at) P(pa,pr|at) P(pa|at)t T T 0.82 0.74 0.6068 0.7732t T F 0.64 0.46 0.2944 0.397t F T 0.64 0.26 0.1664 t F F 0.19 0.54 0.1026

23

Join factors “prepared”

attend

pass

P(at)=.8

P(pa|at) pa attend0.7732 t T0.397 t F

24

Join factors

attend

pass

P(at)=.8

Original Joint Normalized:pa attend P(pa|at) P(at) P(pa,sm) P(at|pa)T T 0.7732 0.8 0.61856 0.89T F 0.397 0.2 0.0794 0.11

25

Join factors

attend,pass

Original Joint Normalized:pa attend P(pa|at) P(at) P(pa,at) P(at|pa)T T 0.7732 0.8 0.61856 0.89T F 0.397 0.2 0.0794 0.11

Bayesian network inference: Big picture

• Exact inference is intractable– There exist techniques to speed up

computations, but worst-case complexity is still exponential except in some classes of networks

• Approximate inference – Sampling, variational methods, message

passing / belief propagation…

Approximate Inference

Sampling

27

Approximate Inference

28

Sampling – the basics ...

• Scrooge McDuck gives you an ancient coin.

• He wants to know what is P(H)

• You have no homework, and nothing good is on television – so you toss it 1 Million times.

• You obtain 700000x Heads, and 300000x Tails.

• What is P(H)?

29

Sampling – the basics ...

P(H)=0.7• Why?

30

Approximate Inference

• Sampling is a hot topic in machine learning,and it’s really simple

• Basic idea:– Draw N samples from a distribution – Compute an approximate posterior probability– Show this converges to the true probability P

• Why sample?– Learning: get samples from a distribution you don’t know– Inference: getting a sample is faster than computing the right

answer (e.g. with variable elimination)

32

S

A

F

Forward Sampling

Cloudy

Sprinkler Rain

WetGrass

Cloudy

Sprinkler Rain

WetGrass

33

+c 0.5-c 0.5

+c+s 0.1

-s 0.9-c +s 0.5

-s 0.5

+c+r 0.8

-r 0.2-c +r 0.2

-r 0.8

+s

+r+w 0.99

-w 0.01

-r

+w 0.90

-w 0.10

-s +r +w 0.90-w 0.10

-r +w 0.01-w 0.99

Samples:

+c, -s, +r, +w-c, +s, -r, +w

Forward Sampling

• This process generates samples with probability:

…i.e. the BN’s joint probability

• If we sample n events, then in the limit we would converge to the true joint probabilities.

• Therefore, the sampling procedure is consistent

34

Example

• We’ll get a bunch of samples from the BN:+c, -s, +r, +w

+c, +s, +r, +w

-c, +s, +r, -w

+c, -s, +r, +w

-c, -s, -r, +w

• If we want to know P(W)– We have counts <+w:4, -w:1>– Normalize to get P(W) = <+w:0.8, -w:0.2>– This will get closer to the true distribution with more samples– Can estimate anything else, too– What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)?– Fast: can use fewer samples if less time (what’s the drawback?)

Cloudy

Sprinkler Rain

WetGrass

C

S R

W

35

Rejection Sampling

• Let’s say we want P(C)– No point keeping all samples around– Just tally counts of C as we go

• Let’s say we want P(C| +s)– Same thing: tally C outcomes, but

ignore (reject) samples which don’t have S=+s

– This is called rejection sampling– It is also consistent for conditional

probabilities (i.e., correct in the limit)

+c, -s, +r, +w

+c, +s, +r, +w

-c, +s, +r, -w

+c, -s, +r, +w

-c, -s, -r, +w

Cloudy

Sprinkler Rain

WetGrass

C

S R

W

36

Likelihood Weighting

• Problem with rejection sampling:– If evidence is unlikely, you reject a lot of samples– You don’t exploit your evidence as you sample– Consider P(B|+a)

• Idea: fix evidence variables and sample the rest

• Problem: sample distribution not consistent!• Solution: weight by probability of evidence given parents

Burglary Alarm

Burglary Alarm

38

-b, -a -b, -a -b, -a -b, -a+b, +a

-b +a -b, +a -b, +a -b, +a+b, +a

Likelihood Weighting

• Sampling distribution if z sampled and e fixed evidence

• Now, samples have weights

• Together, weighted sampling distribution is consistent

Cloudy

R

C

S

W

39

Likelihood Weighting

40

+c 0.5-c 0.5

+c+s 0.1

-s 0.9-c +s 0.5

-s 0.5

+c+r 0.8

-r 0.2-c +r 0.2

-r 0.8

+s

+r+w 0.99

-w 0.01

-r

+w 0.90

-w 0.10

-s +r +w 0.90-w 0.10

-r +w 0.01-w 0.99

Samples:

+c, +s, +r, +w…

Cloudy

Sprinkler Rain

WetGrass

Cloudy

Sprinkler Rain

WetGrass

Inference: Sum over weights that match query value Divide by total sample weight What is P(C|+w,+r)?

Likelihood Weighting Example

41

Cloudy Rainy Sprinkler Wet Grass Weight0 1 1 1 0.4950 0 1 1 0.450 0 1 1 0.450 0 1 1 0.451 0 1 1 0.09

Likelihood Weighting• Likelihood weighting is good

– We have taken evidence into account as we generate the sample

– E.g. here, W’s value will get picked based on the evidence values of S, R

– More of our samples will reflect the state of the world suggested by the evidence

• Likelihood weighting doesn’t solve all our problems– Evidence influences the choice of

downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence)

• We would like to consider evidence when we sample every variable 42

Cloudy

Rain

C

S R

W

Summary

• Sampling can be really effective– The dominating approach to inference in BNs

• Approaches:– Forward (/Prior) Sampling– Rejection Sampling– Likelihood Weighted Sampling

47

Bayesian decision making

• Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E

• Inference problem: given some evidence E = e, what is P(X | e)?

• Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x1,e1), …, (xn,en)}

Learning in Bayes Nets

• Task 1: Given the network structure and given data (observed settings for the variables) learn the CPTs for the Bayes Net.

• Task 2: Given only the data (and possibly a prior over Bayes Nets), learn the entire Bayes Net (both Net structure and CPTs).

Parameter learning• Suppose we know the network structure (but not

the parameters), and have a training set of complete observations

Sample C S R W

1 T F T T

2 F T F T

3 T F F F

4 T T T T

5 F T F T

6 T F T F

… … … …. …

?

??

??

????

Training set

Parameter learning• Suppose we know the network structure (but not

the parameters), and have a training set of complete observations– P(X | Parents(X)) is given by the observed

frequencies of the different values of X for each combination of parent values

Parameter learning• Incomplete observations

• Expectation maximization (EM) algorithm for dealing with missing data

Sample C S R W

1 ? F T T

2 ? T F T

3 ? F F F

4 ? T T T

5 ? T F T

6 ? F T F

… … … …. …

?

??

??

????

Training set

Learning from incomplete data

• Observe real-world data– Some variables are

unobserved– Trick: Fill them in

• EM algorithm– E=Expectation– M=Maximization

53

Sample C S R W

1 ? F T T

2 ? T F T

3 ? F F F

4 ? T T T

5 ? T F T

6 ? F T F

… … … …. …

EM-Algorithm

• Initialize CPTs (e.g. randomly, uniformly)• Repeat until convergence

– E-Step:• Compute for all i and x (inference)

– M-Step:• Update CPT tables

54

Parameter learning

• What if the network structure is unknown?– Structure learning algorithms exist, but they are pretty

complicated…

Sample C S R W

1 T F T T

2 F T F T

3 T F F F

4 T T T T

5 F T F T

6 T F T F

… … … …. …

Training setC

S R

W

?

Summary: Bayesian networks

• Joint Distribution• Independence• Inference• Parameters• Learning

Naïve Bayes

Bayesian inference, Naïve Bayes model

http://xkcd.com/1236/

Bayes Rule

• The product rule gives us two ways to factor a joint probability:

• Therefore,

• Why is this useful?– Key tool for probabilistic inference: can get diagnostic probability

from causal probability • E.g., P(Cavity | Toothache) from P(Toothache | Cavity)

)()|()()|(),( APABPBPBAPBAP

)(

)()|()|(

BP

APABPBAP

Rev. Thomas Bayes(1702-1761)

Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony

in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony

in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

)Predict(

)Rain()Rain|Predict()Predict|Rain(

P

PPP

111.00986.00126.0

0126.0

986.01.0014.09.0

014.09.0

)Rain()Rain|Predict()Rain()Rain|Predict(

)Rain()Rain|Predict(

PPPP

PP

Probabilistic inference• Suppose the agent has to make a decision about

the value of an unobserved query variable X given some observed evidence variable(s) E = e

– Examples: X = {spam, not spam}, e = email messageX = {zebra, giraffe, hippo}, e = image features

Bayesian decision theory• Let x be the value predicted by the agent and x* be

the true value of X. • The agent has a loss function, which is 0 if x = x*

and 1 otherwise• Expected loss:

• What is the estimate of X that minimizes the expected loss?– The one that has the greatest posterior probability P(x|e)– This is called the Maximum a Posteriori (MAP) decision

)|*(1)|()|()*,(*

exPexPexPxxLxxx

MAP decision• Value x of X that has the highest posterior

probability given the evidence E = e:

• Maximum likelihood (ML) decision:

)|(maxarg* xePx x

)()|()|( xPxePexP likelihood priorposterior

)(

)()|()|(maxarg*

eEP

xXPxXeEPeExXPx x

)()|(maxarg xXPxXeEPx

Naïve Bayes model• Suppose we have many different types of observations

(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X

• MAP decision involves estimating

– If each feature Ei can take on k values, how many entries are in the (conditional) joint probability table P(E1, …, En |X = x)?

)()|,,(),,|( 11 XPXEEPEEXP nn

Naïve Bayes model• Suppose we have many different types of observations

(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X

• MAP decision involves estimating

• We can make the simplifying assumption that the different features are conditionally independent given the hypothesis:

– If each feature can take on k values, what is the complexity of storing the resulting distributions?

)()|,,(),,|( 11 XPXEEPEEXP nn

n

iin XEPXEEP

11 )|()|,,(

Exchangeability

De Finetti Theorem of exchangeability (bag of words theorem): the joint probability distribution underlying the data is invariant to permutation.

Naïve Bayes model• Posterior:

• MAP decision:

n

iii

nn

nn

xXeEPxXP

xXeEeEPxXP

eEeExXP

1

11

11

)|()(

)|,,()(

),,|(

likelihoodpriorposterior

n

iix xePxPexPx

1

)|()()|(maxarg*

A Simple Example – Naïve Bayes

C – Class F - Features

We only specify (parameters): prior over class labels

how each feature depends on the class

Slide from Dan Klein

Spam filter• MAP decision: to minimize the probability of error, we should

classify a message as spam if P(spam | message) > P(¬spam | message)

Spam filter• MAP decision: to minimize the probability of error, we should

classify a message as spam if P(spam | message) > P(¬spam | message)

• We have P(spam | message) P(message | spam)P(spam) and P(¬spam | message) P(message | ¬spam)P(¬spam)

• To enable classification, we need to be able to estimate the likelihoods P(message | spam) and P(message | ¬spam) and priors P(spam) and P(¬spam)

Naïve Bayes Representation• Goal: estimate likelihoods P(message | spam) and

P(message | ¬spam) and priors P(spam) and P(¬spam)• Likelihood: bag of words representation

– The message is a sequence of words (w1, …, wn)

– The order of the words in the message is not important– Each word is conditionally independent of the others given

message class (spam or not spam)

Naïve Bayes Representation• Goal: estimate likelihoods P(message | spam) and

P(message | ¬spam) and priors P(spam) and P(¬spam)• Likelihood: bag of words representation

– The message is a sequence of words (w1, …, wn)

– The order of the words in the message is not important– Each word is conditionally independent of the others given

message class (spam or not spam)

– Thus, the problem is reduced to estimating marginal likelihoods of individual words P(wi | spam) and P(wi | ¬spam)

n

iin spamwPspamwwPspammessageP

11 )|()|,,()|(

• General MAP rule for Naïve Bayes:

• Thus, the filter should classify the message as spam if

Summary: Decision rule

n

iin spamwPspamPwwspamP

11 )|()(),,|(

likelihoodpriorposterior

n

iix xePxPexPx

1

)|()()|(maxarg*

n

ii

n

ii spamwPspamPspamwPspamP

11

)|()()|()(

Slide from Dan Klein

Slide from Dan Klein

Percentage of documents in training set labeled as spam/ham

Slide from Dan Klein

In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words).

Slide from Dan Klein

In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words).

Slide from Dan Klein

Recommended