73
Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan 1

Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Embed Size (px)

Citation preview

Page 1: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

1

Bayesian Networks

Tamara Berg

CS 560 Artificial Intelligence

Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan

Page 2: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Announcements

• HW3 released online this afternoon, due Oct 29, 11:59pm

• Demo by Mark Zhao!

Page 3: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Bayesian decision making

• Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E

• Inference problem: given some evidence E = e, what is P(X | e)?

• Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x1,e1), …, (xn,en)}

Page 4: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Probabilistic inference A general scenario:

Query variables: X Evidence (observed) variables: E = e Unobserved variables: Y

If we know the full joint distribution P(X, E, Y), how can we perform inference about X?

y

yeXe

eXeEX ),,(

)(

),()|( P

P

PP

Page 5: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Inference

• Inference: calculating some useful quantity from a joint probability distribution

• Examples:– Posterior probability:

– Most likely explanation:

6

B E

A

J M

Page 6: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Inference – computing conditional probabilities

Marginalization:Conditional Probabilities:

Page 7: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Inference by Enumeration

• Given unlimited time, inference in BNs is easy• Recipe:

– State the marginal probabilities you need– Figure out ALL the atomic probabilities you need– Calculate and combine them

• Example:

8

B E

A

J M

Page 8: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Example: Enumeration

• In this simple method, we only need the BN to synthesize the joint entries

9

Page 9: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Probabilistic inference A general scenario:

Query variables: X Evidence (observed) variables: E = e Unobserved variables: Y

If we know the full joint distribution P(X, E, Y), how can we perform inference about X?

Problems Full joint distributions are too large Marginalizing out Y may involve too many summation terms

y

yeXe

eXeEX ),,(

)(

),()|( P

P

PP

Page 10: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Inference by Enumeration?

11

Page 11: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Variable Elimination

• Why is inference by enumeration on a Bayes Net inefficient?– You join up the whole joint distribution before you sum

out the hidden variables– You end up repeating a lot of work!

• Idea: interleave joining and marginalizing!– Called “Variable Elimination”– Choosing the order to eliminate variables that

minimizes work is NP-hard, but *anything* sensible is much faster than inference by enumeration

12

Page 12: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

General Variable Elimination

• Query:

• Start with initial factors:– Local CPTs (but instantiated by evidence)

• While there are still hidden variables (not Q or evidence):– Pick a hidden variable H– Join all factors mentioning H– Eliminate (sum out) H

• Join all remaining factors and normalize

13

Page 13: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

14

Example: Variable elimination

Query: What is the probability that a student attends class, given that they pass the exam?

[based on slides taken from UMBC CMSC 671, 2005]

P(pr|at,st) at st0.9 T T0.5 T F0.7 F T0.1 F F

attend study

preparedfair

pass

P(at)=.8P(st)=.6

P(fa)=.9

P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F

Page 14: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

15

Join study factors

attend study

preparedfair

pass

P(at)=.8P(st)=.6

P(fa)=.9

Original Joint Marginalprep study attend P(pr|at,st) P(st) P(pr,st|sm) P(pr|sm)

T T T 0.9 0.6 0.54 0.74T F T 0.5 0.4 0.2T T F 0.7 0.6 0.42 0.46T F F 0.1 0.4 0.04 F T T 0.1 0.6 0.06 0.26F F T 0.5 0.4 0.2 F T F 0.3 0.6 0.18 0.54F F F 0.9 0.4 0.36

P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F

Page 15: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

16

Marginalize out study

attend

prepared,study

fair

pass

P(at)=.8

P(fa)=.9

Original Joint Marginalprep study attend P(pr|at,st) P(st) P(pr,st|at) P(pr|at)

T T T 0.9 0.6 0.54 0.74T F T 0.5 0.4 0.2T T F 0.7 0.6 0.42 0.46T F F 0.1 0.4 0.04 F T T 0.1 0.6 0.06 0.26F F T 0.5 0.4 0.2 F T F 0.3 0.6 0.18 0.54F F F 0.9 0.4 0.36

P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F

Page 16: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

17

Remove “study”

attend

preparedfair

pass

P(at)=.8

P(fa)=.9

P(pr|at) pr at0.74 T T0.46 T F0.26 F T0.54 F F

P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F

Page 17: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

18

Join factors “fair”

attend

preparedfair

pass

P(at)=.8

P(fa)=.9

P(pr|at) prep attend0.74 T T0.46 T F0.26 F T0.54 F F

Original Joint Marginal

pa pre attend fairP(pa|

at,pre,fa) P(fair)P(pa,fa|sm,pre)

P(pa|sm,pre)

t T T T 0.9 0.9 0.81 0.82t T T F 0.1 0.1 0.01 t T F T 0.7 0.9 0.63 0.64t T F F 0.1 0.1 0.01 t F T T 0.7 0.9 0.63 0.64t F T F 0.1 0.1 0.01 t F F T 0.2 0.9 0.18 0.19t F F F 0.1 0.1 0.01

Page 18: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

19

Marginalize out “fair”

attend

prepared

pass,fair

P(at)=.8

P(pr|at) prep attend0.74 T T0.46 T F0.26 F T0.54 F F

Original Joint Marginal

pa pre attend fair P(pa|at,pre,fa) P(fair) P(pa,fa|at,pre) P(pa|at,pre)T T T T 0.9 0.9 0.81 0.82T T T F 0.1 0.1 0.01 T T F T 0.7 0.9 0.63 0.64T T F F 0.1 0.1 0.01 T F T T 0.7 0.9 0.63 0.64T F T F 0.1 0.1 0.01 T F F T 0.2 0.9 0.18 0.19T F F F 0.1 0.1 0.01

Page 19: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

20

Marginalize out “fair”

attend

prepared

pass

P(at)=.8

P(pr|at) prep attend0.74 T T0.46 T F0.26 F T0.54 F F

P(pa|at,pre) pa pre attend0.82 t T T0.64 t T F0.64 t F T0.19 t F F

Page 20: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

21

Join factors “prepared”

attend

prepared

pass

P(at)=.8

Original Joint Marginalpa pre attend P(pa|at,pr) P(pr|at) P(pa,pr|sm) P(pa|sm)t T T 0.82 0.74 0.6068 0.7732t T F 0.64 0.46 0.2944 0.397t F T 0.64 0.26 0.1664 t F F 0.19 0.54 0.1026

Page 21: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

22

Join factors “prepared”

attend

pass,prepared

P(at)=.8

Original Joint Marginalpa pre attend P(pa|at,pr) P(pr|at) P(pa,pr|at) P(pa|at)t T T 0.82 0.74 0.6068 0.7732t T F 0.64 0.46 0.2944 0.397t F T 0.64 0.26 0.1664 t F F 0.19 0.54 0.1026

Page 22: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

23

Join factors “prepared”

attend

pass

P(at)=.8

P(pa|at) pa attend0.7732 t T0.397 t F

Page 23: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

24

Join factors

attend

pass

P(at)=.8

Original Joint Normalized:pa attend P(pa|at) P(at) P(pa,sm) P(at|pa)T T 0.7732 0.8 0.61856 0.89T F 0.397 0.2 0.0794 0.11

Page 24: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

25

Join factors

attend,pass

Original Joint Normalized:pa attend P(pa|at) P(at) P(pa,at) P(at|pa)T T 0.7732 0.8 0.61856 0.89T F 0.397 0.2 0.0794 0.11

Page 25: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Bayesian network inference: Big picture

• Exact inference is intractable– There exist techniques to speed up

computations, but worst-case complexity is still exponential except in some classes of networks

• Approximate inference – Sampling, variational methods, message

passing / belief propagation…

Page 26: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Approximate Inference

Sampling

27

Page 27: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Approximate Inference

28

Page 28: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Sampling – the basics ...

• Scrooge McDuck gives you an ancient coin.

• He wants to know what is P(H)

• You have no homework, and nothing good is on television – so you toss it 1 Million times.

• You obtain 700000x Heads, and 300000x Tails.

• What is P(H)?

29

Page 29: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Sampling – the basics ...

P(H)=0.7• Why?

30

Page 30: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Approximate Inference

• Sampling is a hot topic in machine learning,and it’s really simple

• Basic idea:– Draw N samples from a distribution – Compute an approximate posterior probability– Show this converges to the true probability P

• Why sample?– Learning: get samples from a distribution you don’t know– Inference: getting a sample is faster than computing the right

answer (e.g. with variable elimination)

32

S

A

F

Page 31: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Forward Sampling

Cloudy

Sprinkler Rain

WetGrass

Cloudy

Sprinkler Rain

WetGrass

33

+c 0.5-c 0.5

+c+s 0.1

-s 0.9-c +s 0.5

-s 0.5

+c+r 0.8

-r 0.2-c +r 0.2

-r 0.8

+s

+r+w 0.99

-w 0.01

-r

+w 0.90

-w 0.10

-s +r +w 0.90-w 0.10

-r +w 0.01-w 0.99

Samples:

+c, -s, +r, +w-c, +s, -r, +w

Page 32: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Forward Sampling

• This process generates samples with probability:

…i.e. the BN’s joint probability

• If we sample n events, then in the limit we would converge to the true joint probabilities.

• Therefore, the sampling procedure is consistent

34

Page 33: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Example

• We’ll get a bunch of samples from the BN:+c, -s, +r, +w

+c, +s, +r, +w

-c, +s, +r, -w

+c, -s, +r, +w

-c, -s, -r, +w

• If we want to know P(W)– We have counts <+w:4, -w:1>– Normalize to get P(W) = <+w:0.8, -w:0.2>– This will get closer to the true distribution with more samples– Can estimate anything else, too– What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)?– Fast: can use fewer samples if less time (what’s the drawback?)

Cloudy

Sprinkler Rain

WetGrass

C

S R

W

35

Page 34: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Rejection Sampling

• Let’s say we want P(C)– No point keeping all samples around– Just tally counts of C as we go

• Let’s say we want P(C| +s)– Same thing: tally C outcomes, but

ignore (reject) samples which don’t have S=+s

– This is called rejection sampling– It is also consistent for conditional

probabilities (i.e., correct in the limit)

+c, -s, +r, +w

+c, +s, +r, +w

-c, +s, +r, -w

+c, -s, +r, +w

-c, -s, -r, +w

Cloudy

Sprinkler Rain

WetGrass

C

S R

W

36

Page 35: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Likelihood Weighting

• Problem with rejection sampling:– If evidence is unlikely, you reject a lot of samples– You don’t exploit your evidence as you sample– Consider P(B|+a)

• Idea: fix evidence variables and sample the rest

• Problem: sample distribution not consistent!• Solution: weight by probability of evidence given parents

Burglary Alarm

Burglary Alarm

38

-b, -a -b, -a -b, -a -b, -a+b, +a

-b +a -b, +a -b, +a -b, +a+b, +a

Page 36: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Likelihood Weighting

• Sampling distribution if z sampled and e fixed evidence

• Now, samples have weights

• Together, weighted sampling distribution is consistent

Cloudy

R

C

S

W

39

Page 37: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Likelihood Weighting

40

+c 0.5-c 0.5

+c+s 0.1

-s 0.9-c +s 0.5

-s 0.5

+c+r 0.8

-r 0.2-c +r 0.2

-r 0.8

+s

+r+w 0.99

-w 0.01

-r

+w 0.90

-w 0.10

-s +r +w 0.90-w 0.10

-r +w 0.01-w 0.99

Samples:

+c, +s, +r, +w…

Cloudy

Sprinkler Rain

WetGrass

Cloudy

Sprinkler Rain

WetGrass

Page 38: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Inference: Sum over weights that match query value Divide by total sample weight What is P(C|+w,+r)?

Likelihood Weighting Example

41

Cloudy Rainy Sprinkler Wet Grass Weight0 1 1 1 0.4950 0 1 1 0.450 0 1 1 0.450 0 1 1 0.451 0 1 1 0.09

Page 39: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Likelihood Weighting• Likelihood weighting is good

– We have taken evidence into account as we generate the sample

– E.g. here, W’s value will get picked based on the evidence values of S, R

– More of our samples will reflect the state of the world suggested by the evidence

• Likelihood weighting doesn’t solve all our problems– Evidence influences the choice of

downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence)

• We would like to consider evidence when we sample every variable 42

Cloudy

Rain

C

S R

W

Page 40: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Summary

• Sampling can be really effective– The dominating approach to inference in BNs

• Approaches:– Forward (/Prior) Sampling– Rejection Sampling– Likelihood Weighted Sampling

47

Page 41: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Bayesian decision making

• Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E

• Inference problem: given some evidence E = e, what is P(X | e)?

• Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x1,e1), …, (xn,en)}

Page 42: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Learning in Bayes Nets

• Task 1: Given the network structure and given data (observed settings for the variables) learn the CPTs for the Bayes Net.

• Task 2: Given only the data (and possibly a prior over Bayes Nets), learn the entire Bayes Net (both Net structure and CPTs).

Page 43: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Parameter learning• Suppose we know the network structure (but not

the parameters), and have a training set of complete observations

Sample C S R W

1 T F T T

2 F T F T

3 T F F F

4 T T T T

5 F T F T

6 T F T F

… … … …. …

?

??

??

????

Training set

Page 44: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Parameter learning• Suppose we know the network structure (but not

the parameters), and have a training set of complete observations– P(X | Parents(X)) is given by the observed

frequencies of the different values of X for each combination of parent values

Page 45: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Parameter learning• Incomplete observations

• Expectation maximization (EM) algorithm for dealing with missing data

Sample C S R W

1 ? F T T

2 ? T F T

3 ? F F F

4 ? T T T

5 ? T F T

6 ? F T F

… … … …. …

?

??

??

????

Training set

Page 46: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Learning from incomplete data

• Observe real-world data– Some variables are

unobserved– Trick: Fill them in

• EM algorithm– E=Expectation– M=Maximization

53

Sample C S R W

1 ? F T T

2 ? T F T

3 ? F F F

4 ? T T T

5 ? T F T

6 ? F T F

… … … …. …

Page 47: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

EM-Algorithm

• Initialize CPTs (e.g. randomly, uniformly)• Repeat until convergence

– E-Step:• Compute for all i and x (inference)

– M-Step:• Update CPT tables

54

Page 48: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Parameter learning

• What if the network structure is unknown?– Structure learning algorithms exist, but they are pretty

complicated…

Sample C S R W

1 T F T T

2 F T F T

3 T F F F

4 T T T T

5 F T F T

6 T F T F

… … … …. …

Training setC

S R

W

?

Page 49: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Summary: Bayesian networks

• Joint Distribution• Independence• Inference• Parameters• Learning

Page 50: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Naïve Bayes

Page 51: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Bayesian inference, Naïve Bayes model

http://xkcd.com/1236/

Page 52: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Bayes Rule

• The product rule gives us two ways to factor a joint probability:

• Therefore,

• Why is this useful?– Key tool for probabilistic inference: can get diagnostic probability

from causal probability • E.g., P(Cavity | Toothache) from P(Toothache | Cavity)

)()|()()|(),( APABPBPBAPBAP

)(

)()|()|(

BP

APABPBAP

Rev. Thomas Bayes(1702-1761)

Page 53: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony

in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Page 54: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony

in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

)Predict(

)Rain()Rain|Predict()Predict|Rain(

P

PPP

111.00986.00126.0

0126.0

986.01.0014.09.0

014.09.0

)Rain()Rain|Predict()Rain()Rain|Predict(

)Rain()Rain|Predict(

PPPP

PP

Page 55: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Probabilistic inference• Suppose the agent has to make a decision about

the value of an unobserved query variable X given some observed evidence variable(s) E = e

– Examples: X = {spam, not spam}, e = email messageX = {zebra, giraffe, hippo}, e = image features

Page 56: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Bayesian decision theory• Let x be the value predicted by the agent and x* be

the true value of X. • The agent has a loss function, which is 0 if x = x*

and 1 otherwise• Expected loss:

• What is the estimate of X that minimizes the expected loss?– The one that has the greatest posterior probability P(x|e)– This is called the Maximum a Posteriori (MAP) decision

)|*(1)|()|()*,(*

exPexPexPxxLxxx

Page 57: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

MAP decision• Value x of X that has the highest posterior

probability given the evidence E = e:

• Maximum likelihood (ML) decision:

)|(maxarg* xePx x

)()|()|( xPxePexP likelihood priorposterior

)(

)()|()|(maxarg*

eEP

xXPxXeEPeExXPx x

)()|(maxarg xXPxXeEPx

Page 58: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Naïve Bayes model• Suppose we have many different types of observations

(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X

• MAP decision involves estimating

– If each feature Ei can take on k values, how many entries are in the (conditional) joint probability table P(E1, …, En |X = x)?

)()|,,(),,|( 11 XPXEEPEEXP nn

Page 59: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Naïve Bayes model• Suppose we have many different types of observations

(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X

• MAP decision involves estimating

• We can make the simplifying assumption that the different features are conditionally independent given the hypothesis:

– If each feature can take on k values, what is the complexity of storing the resulting distributions?

)()|,,(),,|( 11 XPXEEPEEXP nn

n

iin XEPXEEP

11 )|()|,,(

Page 60: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Exchangeability

De Finetti Theorem of exchangeability (bag of words theorem): the joint probability distribution underlying the data is invariant to permutation.

Page 61: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Naïve Bayes model• Posterior:

• MAP decision:

n

iii

nn

nn

xXeEPxXP

xXeEeEPxXP

eEeExXP

1

11

11

)|()(

)|,,()(

),,|(

likelihoodpriorposterior

n

iix xePxPexPx

1

)|()()|(maxarg*

Page 62: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

A Simple Example – Naïve Bayes

C – Class F - Features

We only specify (parameters): prior over class labels

how each feature depends on the class

Page 63: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Slide from Dan Klein

Page 64: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Spam filter• MAP decision: to minimize the probability of error, we should

classify a message as spam if P(spam | message) > P(¬spam | message)

Page 65: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Spam filter• MAP decision: to minimize the probability of error, we should

classify a message as spam if P(spam | message) > P(¬spam | message)

• We have P(spam | message) P(message | spam)P(spam) and P(¬spam | message) P(message | ¬spam)P(¬spam)

• To enable classification, we need to be able to estimate the likelihoods P(message | spam) and P(message | ¬spam) and priors P(spam) and P(¬spam)

Page 66: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Naïve Bayes Representation• Goal: estimate likelihoods P(message | spam) and

P(message | ¬spam) and priors P(spam) and P(¬spam)• Likelihood: bag of words representation

– The message is a sequence of words (w1, …, wn)

– The order of the words in the message is not important– Each word is conditionally independent of the others given

message class (spam or not spam)

Page 67: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Naïve Bayes Representation• Goal: estimate likelihoods P(message | spam) and

P(message | ¬spam) and priors P(spam) and P(¬spam)• Likelihood: bag of words representation

– The message is a sequence of words (w1, …, wn)

– The order of the words in the message is not important– Each word is conditionally independent of the others given

message class (spam or not spam)

– Thus, the problem is reduced to estimating marginal likelihoods of individual words P(wi | spam) and P(wi | ¬spam)

n

iin spamwPspamwwPspammessageP

11 )|()|,,()|(

Page 68: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

• General MAP rule for Naïve Bayes:

• Thus, the filter should classify the message as spam if

Summary: Decision rule

n

iin spamwPspamPwwspamP

11 )|()(),,|(

likelihoodpriorposterior

n

iix xePxPexPx

1

)|()()|(maxarg*

n

ii

n

ii spamwPspamPspamwPspamP

11

)|()()|()(

Page 69: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Slide from Dan Klein

Page 70: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Slide from Dan Klein

Page 71: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Percentage of documents in training set labeled as spam/ham

Slide from Dan Klein

Page 72: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words).

Slide from Dan Klein

Page 73: Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words).

Slide from Dan Klein