Upload
marlene-ramsey
View
218
Download
0
Embed Size (px)
Citation preview
1
Bayesian Networks
Tamara Berg
CS 560 Artificial Intelligence
Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan
Announcements
• HW3 released online this afternoon, due Oct 29, 11:59pm
• Demo by Mark Zhao!
Bayesian decision making
• Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E
• Inference problem: given some evidence E = e, what is P(X | e)?
• Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x1,e1), …, (xn,en)}
Probabilistic inference A general scenario:
Query variables: X Evidence (observed) variables: E = e Unobserved variables: Y
If we know the full joint distribution P(X, E, Y), how can we perform inference about X?
y
yeXe
eXeEX ),,(
)(
),()|( P
P
PP
Inference
• Inference: calculating some useful quantity from a joint probability distribution
• Examples:– Posterior probability:
– Most likely explanation:
6
B E
A
J M
Inference – computing conditional probabilities
Marginalization:Conditional Probabilities:
Inference by Enumeration
• Given unlimited time, inference in BNs is easy• Recipe:
– State the marginal probabilities you need– Figure out ALL the atomic probabilities you need– Calculate and combine them
• Example:
8
B E
A
J M
Example: Enumeration
• In this simple method, we only need the BN to synthesize the joint entries
9
Probabilistic inference A general scenario:
Query variables: X Evidence (observed) variables: E = e Unobserved variables: Y
If we know the full joint distribution P(X, E, Y), how can we perform inference about X?
Problems Full joint distributions are too large Marginalizing out Y may involve too many summation terms
y
yeXe
eXeEX ),,(
)(
),()|( P
P
PP
Inference by Enumeration?
11
Variable Elimination
• Why is inference by enumeration on a Bayes Net inefficient?– You join up the whole joint distribution before you sum
out the hidden variables– You end up repeating a lot of work!
• Idea: interleave joining and marginalizing!– Called “Variable Elimination”– Choosing the order to eliminate variables that
minimizes work is NP-hard, but *anything* sensible is much faster than inference by enumeration
12
General Variable Elimination
• Query:
• Start with initial factors:– Local CPTs (but instantiated by evidence)
• While there are still hidden variables (not Q or evidence):– Pick a hidden variable H– Join all factors mentioning H– Eliminate (sum out) H
• Join all remaining factors and normalize
13
14
Example: Variable elimination
Query: What is the probability that a student attends class, given that they pass the exam?
[based on slides taken from UMBC CMSC 671, 2005]
P(pr|at,st) at st0.9 T T0.5 T F0.7 F T0.1 F F
attend study
preparedfair
pass
P(at)=.8P(st)=.6
P(fa)=.9
P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F
15
Join study factors
attend study
preparedfair
pass
P(at)=.8P(st)=.6
P(fa)=.9
Original Joint Marginalprep study attend P(pr|at,st) P(st) P(pr,st|sm) P(pr|sm)
T T T 0.9 0.6 0.54 0.74T F T 0.5 0.4 0.2T T F 0.7 0.6 0.42 0.46T F F 0.1 0.4 0.04 F T T 0.1 0.6 0.06 0.26F F T 0.5 0.4 0.2 F T F 0.3 0.6 0.18 0.54F F F 0.9 0.4 0.36
P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F
16
Marginalize out study
attend
prepared,study
fair
pass
P(at)=.8
P(fa)=.9
Original Joint Marginalprep study attend P(pr|at,st) P(st) P(pr,st|at) P(pr|at)
T T T 0.9 0.6 0.54 0.74T F T 0.5 0.4 0.2T T F 0.7 0.6 0.42 0.46T F F 0.1 0.4 0.04 F T T 0.1 0.6 0.06 0.26F F T 0.5 0.4 0.2 F T F 0.3 0.6 0.18 0.54F F F 0.9 0.4 0.36
P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F
17
Remove “study”
attend
preparedfair
pass
P(at)=.8
P(fa)=.9
P(pr|at) pr at0.74 T T0.46 T F0.26 F T0.54 F F
P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F
18
Join factors “fair”
attend
preparedfair
pass
P(at)=.8
P(fa)=.9
P(pr|at) prep attend0.74 T T0.46 T F0.26 F T0.54 F F
Original Joint Marginal
pa pre attend fairP(pa|
at,pre,fa) P(fair)P(pa,fa|sm,pre)
P(pa|sm,pre)
t T T T 0.9 0.9 0.81 0.82t T T F 0.1 0.1 0.01 t T F T 0.7 0.9 0.63 0.64t T F F 0.1 0.1 0.01 t F T T 0.7 0.9 0.63 0.64t F T F 0.1 0.1 0.01 t F F T 0.2 0.9 0.18 0.19t F F F 0.1 0.1 0.01
19
Marginalize out “fair”
attend
prepared
pass,fair
P(at)=.8
P(pr|at) prep attend0.74 T T0.46 T F0.26 F T0.54 F F
Original Joint Marginal
pa pre attend fair P(pa|at,pre,fa) P(fair) P(pa,fa|at,pre) P(pa|at,pre)T T T T 0.9 0.9 0.81 0.82T T T F 0.1 0.1 0.01 T T F T 0.7 0.9 0.63 0.64T T F F 0.1 0.1 0.01 T F T T 0.7 0.9 0.63 0.64T F T F 0.1 0.1 0.01 T F F T 0.2 0.9 0.18 0.19T F F F 0.1 0.1 0.01
20
Marginalize out “fair”
attend
prepared
pass
P(at)=.8
P(pr|at) prep attend0.74 T T0.46 T F0.26 F T0.54 F F
P(pa|at,pre) pa pre attend0.82 t T T0.64 t T F0.64 t F T0.19 t F F
21
Join factors “prepared”
attend
prepared
pass
P(at)=.8
Original Joint Marginalpa pre attend P(pa|at,pr) P(pr|at) P(pa,pr|sm) P(pa|sm)t T T 0.82 0.74 0.6068 0.7732t T F 0.64 0.46 0.2944 0.397t F T 0.64 0.26 0.1664 t F F 0.19 0.54 0.1026
22
Join factors “prepared”
attend
pass,prepared
P(at)=.8
Original Joint Marginalpa pre attend P(pa|at,pr) P(pr|at) P(pa,pr|at) P(pa|at)t T T 0.82 0.74 0.6068 0.7732t T F 0.64 0.46 0.2944 0.397t F T 0.64 0.26 0.1664 t F F 0.19 0.54 0.1026
23
Join factors “prepared”
attend
pass
P(at)=.8
P(pa|at) pa attend0.7732 t T0.397 t F
24
Join factors
attend
pass
P(at)=.8
Original Joint Normalized:pa attend P(pa|at) P(at) P(pa,sm) P(at|pa)T T 0.7732 0.8 0.61856 0.89T F 0.397 0.2 0.0794 0.11
25
Join factors
attend,pass
Original Joint Normalized:pa attend P(pa|at) P(at) P(pa,at) P(at|pa)T T 0.7732 0.8 0.61856 0.89T F 0.397 0.2 0.0794 0.11
Bayesian network inference: Big picture
• Exact inference is intractable– There exist techniques to speed up
computations, but worst-case complexity is still exponential except in some classes of networks
• Approximate inference – Sampling, variational methods, message
passing / belief propagation…
Approximate Inference
Sampling
27
Approximate Inference
28
Sampling – the basics ...
• Scrooge McDuck gives you an ancient coin.
• He wants to know what is P(H)
• You have no homework, and nothing good is on television – so you toss it 1 Million times.
• You obtain 700000x Heads, and 300000x Tails.
• What is P(H)?
29
Sampling – the basics ...
P(H)=0.7• Why?
30
Approximate Inference
• Sampling is a hot topic in machine learning,and it’s really simple
• Basic idea:– Draw N samples from a distribution – Compute an approximate posterior probability– Show this converges to the true probability P
• Why sample?– Learning: get samples from a distribution you don’t know– Inference: getting a sample is faster than computing the right
answer (e.g. with variable elimination)
32
S
A
F
Forward Sampling
Cloudy
Sprinkler Rain
WetGrass
Cloudy
Sprinkler Rain
WetGrass
33
+c 0.5-c 0.5
+c+s 0.1
-s 0.9-c +s 0.5
-s 0.5
+c+r 0.8
-r 0.2-c +r 0.2
-r 0.8
+s
+r+w 0.99
-w 0.01
-r
+w 0.90
-w 0.10
-s +r +w 0.90-w 0.10
-r +w 0.01-w 0.99
Samples:
+c, -s, +r, +w-c, +s, -r, +w
…
Forward Sampling
• This process generates samples with probability:
…i.e. the BN’s joint probability
• If we sample n events, then in the limit we would converge to the true joint probabilities.
• Therefore, the sampling procedure is consistent
34
Example
• We’ll get a bunch of samples from the BN:+c, -s, +r, +w
+c, +s, +r, +w
-c, +s, +r, -w
+c, -s, +r, +w
-c, -s, -r, +w
• If we want to know P(W)– We have counts <+w:4, -w:1>– Normalize to get P(W) = <+w:0.8, -w:0.2>– This will get closer to the true distribution with more samples– Can estimate anything else, too– What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)?– Fast: can use fewer samples if less time (what’s the drawback?)
Cloudy
Sprinkler Rain
WetGrass
C
S R
W
35
Rejection Sampling
• Let’s say we want P(C)– No point keeping all samples around– Just tally counts of C as we go
• Let’s say we want P(C| +s)– Same thing: tally C outcomes, but
ignore (reject) samples which don’t have S=+s
– This is called rejection sampling– It is also consistent for conditional
probabilities (i.e., correct in the limit)
+c, -s, +r, +w
+c, +s, +r, +w
-c, +s, +r, -w
+c, -s, +r, +w
-c, -s, -r, +w
Cloudy
Sprinkler Rain
WetGrass
C
S R
W
36
Likelihood Weighting
• Problem with rejection sampling:– If evidence is unlikely, you reject a lot of samples– You don’t exploit your evidence as you sample– Consider P(B|+a)
• Idea: fix evidence variables and sample the rest
• Problem: sample distribution not consistent!• Solution: weight by probability of evidence given parents
Burglary Alarm
Burglary Alarm
38
-b, -a -b, -a -b, -a -b, -a+b, +a
-b +a -b, +a -b, +a -b, +a+b, +a
Likelihood Weighting
• Sampling distribution if z sampled and e fixed evidence
• Now, samples have weights
• Together, weighted sampling distribution is consistent
Cloudy
R
C
S
W
39
Likelihood Weighting
40
+c 0.5-c 0.5
+c+s 0.1
-s 0.9-c +s 0.5
-s 0.5
+c+r 0.8
-r 0.2-c +r 0.2
-r 0.8
+s
+r+w 0.99
-w 0.01
-r
+w 0.90
-w 0.10
-s +r +w 0.90-w 0.10
-r +w 0.01-w 0.99
Samples:
+c, +s, +r, +w…
Cloudy
Sprinkler Rain
WetGrass
Cloudy
Sprinkler Rain
WetGrass
Inference: Sum over weights that match query value Divide by total sample weight What is P(C|+w,+r)?
Likelihood Weighting Example
41
Cloudy Rainy Sprinkler Wet Grass Weight0 1 1 1 0.4950 0 1 1 0.450 0 1 1 0.450 0 1 1 0.451 0 1 1 0.09
Likelihood Weighting• Likelihood weighting is good
– We have taken evidence into account as we generate the sample
– E.g. here, W’s value will get picked based on the evidence values of S, R
– More of our samples will reflect the state of the world suggested by the evidence
• Likelihood weighting doesn’t solve all our problems– Evidence influences the choice of
downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence)
• We would like to consider evidence when we sample every variable 42
Cloudy
Rain
C
S R
W
Summary
• Sampling can be really effective– The dominating approach to inference in BNs
• Approaches:– Forward (/Prior) Sampling– Rejection Sampling– Likelihood Weighted Sampling
47
Bayesian decision making
• Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E
• Inference problem: given some evidence E = e, what is P(X | e)?
• Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x1,e1), …, (xn,en)}
Learning in Bayes Nets
• Task 1: Given the network structure and given data (observed settings for the variables) learn the CPTs for the Bayes Net.
• Task 2: Given only the data (and possibly a prior over Bayes Nets), learn the entire Bayes Net (both Net structure and CPTs).
Parameter learning• Suppose we know the network structure (but not
the parameters), and have a training set of complete observations
Sample C S R W
1 T F T T
2 F T F T
3 T F F F
4 T T T T
5 F T F T
6 T F T F
… … … …. …
?
??
??
????
Training set
Parameter learning• Suppose we know the network structure (but not
the parameters), and have a training set of complete observations– P(X | Parents(X)) is given by the observed
frequencies of the different values of X for each combination of parent values
Parameter learning• Incomplete observations
• Expectation maximization (EM) algorithm for dealing with missing data
Sample C S R W
1 ? F T T
2 ? T F T
3 ? F F F
4 ? T T T
5 ? T F T
6 ? F T F
… … … …. …
?
??
??
????
Training set
Learning from incomplete data
• Observe real-world data– Some variables are
unobserved– Trick: Fill them in
• EM algorithm– E=Expectation– M=Maximization
53
Sample C S R W
1 ? F T T
2 ? T F T
3 ? F F F
4 ? T T T
5 ? T F T
6 ? F T F
… … … …. …
EM-Algorithm
• Initialize CPTs (e.g. randomly, uniformly)• Repeat until convergence
– E-Step:• Compute for all i and x (inference)
– M-Step:• Update CPT tables
54
Parameter learning
• What if the network structure is unknown?– Structure learning algorithms exist, but they are pretty
complicated…
Sample C S R W
1 T F T T
2 F T F T
3 T F F F
4 T T T T
5 F T F T
6 T F T F
… … … …. …
Training setC
S R
W
?
Summary: Bayesian networks
• Joint Distribution• Independence• Inference• Parameters• Learning
Naïve Bayes
Bayes Rule
• The product rule gives us two ways to factor a joint probability:
• Therefore,
• Why is this useful?– Key tool for probabilistic inference: can get diagnostic probability
from causal probability • E.g., P(Cavity | Toothache) from P(Toothache | Cavity)
)()|()()|(),( APABPBPBAPBAP
)(
)()|()|(
BP
APABPBAP
Rev. Thomas Bayes(1702-1761)
Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony
in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?
Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony
in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?
)Predict(
)Rain()Rain|Predict()Predict|Rain(
P
PPP
111.00986.00126.0
0126.0
986.01.0014.09.0
014.09.0
)Rain()Rain|Predict()Rain()Rain|Predict(
)Rain()Rain|Predict(
PPPP
PP
Probabilistic inference• Suppose the agent has to make a decision about
the value of an unobserved query variable X given some observed evidence variable(s) E = e
– Examples: X = {spam, not spam}, e = email messageX = {zebra, giraffe, hippo}, e = image features
Bayesian decision theory• Let x be the value predicted by the agent and x* be
the true value of X. • The agent has a loss function, which is 0 if x = x*
and 1 otherwise• Expected loss:
• What is the estimate of X that minimizes the expected loss?– The one that has the greatest posterior probability P(x|e)– This is called the Maximum a Posteriori (MAP) decision
)|*(1)|()|()*,(*
exPexPexPxxLxxx
MAP decision• Value x of X that has the highest posterior
probability given the evidence E = e:
• Maximum likelihood (ML) decision:
)|(maxarg* xePx x
)()|()|( xPxePexP likelihood priorposterior
)(
)()|()|(maxarg*
eEP
xXPxXeEPeExXPx x
)()|(maxarg xXPxXeEPx
Naïve Bayes model• Suppose we have many different types of observations
(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X
• MAP decision involves estimating
– If each feature Ei can take on k values, how many entries are in the (conditional) joint probability table P(E1, …, En |X = x)?
)()|,,(),,|( 11 XPXEEPEEXP nn
Naïve Bayes model• Suppose we have many different types of observations
(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X
• MAP decision involves estimating
• We can make the simplifying assumption that the different features are conditionally independent given the hypothesis:
– If each feature can take on k values, what is the complexity of storing the resulting distributions?
)()|,,(),,|( 11 XPXEEPEEXP nn
n
iin XEPXEEP
11 )|()|,,(
Exchangeability
De Finetti Theorem of exchangeability (bag of words theorem): the joint probability distribution underlying the data is invariant to permutation.
Naïve Bayes model• Posterior:
• MAP decision:
n
iii
nn
nn
xXeEPxXP
xXeEeEPxXP
eEeExXP
1
11
11
)|()(
)|,,()(
),,|(
likelihoodpriorposterior
n
iix xePxPexPx
1
)|()()|(maxarg*
A Simple Example – Naïve Bayes
C – Class F - Features
We only specify (parameters): prior over class labels
how each feature depends on the class
Slide from Dan Klein
Spam filter• MAP decision: to minimize the probability of error, we should
classify a message as spam if P(spam | message) > P(¬spam | message)
Spam filter• MAP decision: to minimize the probability of error, we should
classify a message as spam if P(spam | message) > P(¬spam | message)
• We have P(spam | message) P(message | spam)P(spam) and P(¬spam | message) P(message | ¬spam)P(¬spam)
• To enable classification, we need to be able to estimate the likelihoods P(message | spam) and P(message | ¬spam) and priors P(spam) and P(¬spam)
Naïve Bayes Representation• Goal: estimate likelihoods P(message | spam) and
P(message | ¬spam) and priors P(spam) and P(¬spam)• Likelihood: bag of words representation
– The message is a sequence of words (w1, …, wn)
– The order of the words in the message is not important– Each word is conditionally independent of the others given
message class (spam or not spam)
Naïve Bayes Representation• Goal: estimate likelihoods P(message | spam) and
P(message | ¬spam) and priors P(spam) and P(¬spam)• Likelihood: bag of words representation
– The message is a sequence of words (w1, …, wn)
– The order of the words in the message is not important– Each word is conditionally independent of the others given
message class (spam or not spam)
– Thus, the problem is reduced to estimating marginal likelihoods of individual words P(wi | spam) and P(wi | ¬spam)
n
iin spamwPspamwwPspammessageP
11 )|()|,,()|(
• General MAP rule for Naïve Bayes:
• Thus, the filter should classify the message as spam if
Summary: Decision rule
n
iin spamwPspamPwwspamP
11 )|()(),,|(
likelihoodpriorposterior
n
iix xePxPexPx
1
)|()()|(maxarg*
n
ii
n
ii spamwPspamPspamwPspamP
11
)|()()|()(
Slide from Dan Klein
Slide from Dan Klein
Percentage of documents in training set labeled as spam/ham
Slide from Dan Klein
In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words).
Slide from Dan Klein
In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words).
Slide from Dan Klein