Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course...

Bayesian Networks

Tamara Berg

CS 560 Artificial Intelligence

Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan

Announcements

• HW3 released online this afternoon, due Oct 29, 11:59pm

• Demo by Mark Zhao!

Bayesian decision making

• Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E

• Inference problem: given some evidence E = e, what is P(X | e)?

• Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x1,e1), …, (xn,en)}

Probabilistic inference A general scenario:

Query variables: X Evidence (observed) variables: E = e Unobserved variables: Y

If we know the full joint distribution P(X, E, Y), how can we perform inference about X?

eXeEX ),,(

),()|( P

Inference

• Inference: calculating some useful quantity from a joint probability distribution

• Examples:– Posterior probability:

– Most likely explanation:

Inference – computing conditional probabilities

Marginalization:Conditional Probabilities:

Inference by Enumeration

• Given unlimited time, inference in BNs is easy• Recipe:

– State the marginal probabilities you need– Figure out ALL the atomic probabilities you need– Calculate and combine them

• Example:

Example: Enumeration

• In this simple method, we only need the BN to synthesize the joint entries

Probabilistic inference A general scenario:

Query variables: X Evidence (observed) variables: E = e Unobserved variables: Y

If we know the full joint distribution P(X, E, Y), how can we perform inference about X?

Problems Full joint distributions are too large Marginalizing out Y may involve too many summation terms

eXeEX ),,(

),()|( P

Inference by Enumeration?

Variable Elimination

• Why is inference by enumeration on a Bayes Net inefficient?– You join up the whole joint distribution before you sum

out the hidden variables– You end up repeating a lot of work!

• Idea: interleave joining and marginalizing!– Called “Variable Elimination”– Choosing the order to eliminate variables that

minimizes work is NP-hard, but *anything* sensible is much faster than inference by enumeration

General Variable Elimination

• Query:

• Start with initial factors:– Local CPTs (but instantiated by evidence)

• While there are still hidden variables (not Q or evidence):– Pick a hidden variable H– Join all factors mentioning H– Eliminate (sum out) H

• Join all remaining factors and normalize

Example: Variable elimination

Query: What is the probability that a student attends class, given that they pass the exam?

[based on slides taken from UMBC CMSC 671, 2005]

P(pr|at,st) at st0.9 T T0.5 T F0.7 F T0.1 F F

attend study

preparedfair

P(at)=.8P(st)=.6

P(fa)=.9

P(pa|at,pre,fa) pr at fa0.9 T T T0.1 T T F0.7 T F T0.1 T F F0.7 F T T0.1 F T F0.2 F F T0.1 F F F

Join study factors

attend study

preparedfair

P(at)=.8P(st)=.6

P(fa)=.9

Original Joint Marginalprep study attend P(pr|at,st) P(st) P(pr,st|sm) P(pr|sm)

T T T 0.9 0.6 0.54 0.74T F T 0.5 0.4 0.2T T F 0.7 0.6 0.42 0.46T F F 0.1 0.4 0.04 F T T 0.1 0.6 0.06 0.26F F T 0.5 0.4 0.2 F T F 0.3 0.6 0.18 0.54F F F 0.9 0.4 0.36

Marginalize out study

attend

prepared,study

P(at)=.8

P(fa)=.9

Original Joint Marginalprep study attend P(pr|at,st) P(st) P(pr,st|at) P(pr|at)

T T T 0.9 0.6 0.54 0.74T F T 0.5 0.4 0.2T T F 0.7 0.6 0.42 0.46T F F 0.1 0.4 0.04 F T T 0.1 0.6 0.06 0.26F F T 0.5 0.4 0.2 F T F 0.3 0.6 0.18 0.54F F F 0.9 0.4 0.36

Remove “study”

attend

preparedfair

P(at)=.8

P(fa)=.9

P(pr|at) pr at0.74 T T0.46 T F0.26 F T0.54 F F

Join factors “fair”

attend

preparedfair

P(at)=.8

P(fa)=.9

P(pr|at) prep attend0.74 T T0.46 T F0.26 F T0.54 F F

Original Joint Marginal

pa pre attend fairP(pa|

at,pre,fa) P(fair)P(pa,fa|sm,pre)

P(pa|sm,pre)

t T T T 0.9 0.9 0.81 0.82t T T F 0.1 0.1 0.01 t T F T 0.7 0.9 0.63 0.64t T F F 0.1 0.1 0.01 t F T T 0.7 0.9 0.63 0.64t F T F 0.1 0.1 0.01 t F F T 0.2 0.9 0.18 0.19t F F F 0.1 0.1 0.01

Marginalize out “fair”

attend

prepared

pass,fair

P(at)=.8

Original Joint Marginal

pa pre attend fair P(pa|at,pre,fa) P(fair) P(pa,fa|at,pre) P(pa|at,pre)T T T T 0.9 0.9 0.81 0.82T T T F 0.1 0.1 0.01 T T F T 0.7 0.9 0.63 0.64T T F F 0.1 0.1 0.01 T F T T 0.7 0.9 0.63 0.64T F T F 0.1 0.1 0.01 T F F T 0.2 0.9 0.18 0.19T F F F 0.1 0.1 0.01

Marginalize out “fair”

attend

prepared

P(at)=.8

P(pa|at,pre) pa pre attend0.82 t T T0.64 t T F0.64 t F T0.19 t F F

Join factors “prepared”

attend

prepared

P(at)=.8

Original Joint Marginalpa pre attend P(pa|at,pr) P(pr|at) P(pa,pr|sm) P(pa|sm)t T T 0.82 0.74 0.6068 0.7732t T F 0.64 0.46 0.2944 0.397t F T 0.64 0.26 0.1664 t F F 0.19 0.54 0.1026

attend

pass,prepared

P(at)=.8

Original Joint Marginalpa pre attend P(pa|at,pr) P(pr|at) P(pa,pr|at) P(pa|at)t T T 0.82 0.74 0.6068 0.7732t T F 0.64 0.46 0.2944 0.397t F T 0.64 0.26 0.1664 t F F 0.19 0.54 0.1026

attend

P(at)=.8

P(pa|at) pa attend0.7732 t T0.397 t F

Join factors

attend

P(at)=.8

Original Joint Normalized:pa attend P(pa|at) P(at) P(pa,sm) P(at|pa)T T 0.7732 0.8 0.61856 0.89T F 0.397 0.2 0.0794 0.11

Join factors

attend,pass

Original Joint Normalized:pa attend P(pa|at) P(at) P(pa,at) P(at|pa)T T 0.7732 0.8 0.61856 0.89T F 0.397 0.2 0.0794 0.11

Bayesian network inference: Big picture

• Exact inference is intractable– There exist techniques to speed up

computations, but worst-case complexity is still exponential except in some classes of networks

• Approximate inference – Sampling, variational methods, message

passing / belief propagation…

Approximate Inference

Sampling

Sampling – the basics ...

• Scrooge McDuck gives you an ancient coin.

• He wants to know what is P(H)

• You have no homework, and nothing good is on television – so you toss it 1 Million times.

• You obtain 700000x Heads, and 300000x Tails.

• What is P(H)?

Sampling – the basics ...

P(H)=0.7• Why?

• Sampling is a hot topic in machine learning,and it’s really simple

• Basic idea:– Draw N samples from a distribution – Compute an approximate posterior probability– Show this converges to the true probability P

• Why sample?– Learning: get samples from a distribution you don’t know– Inference: getting a sample is faster than computing the right

answer (e.g. with variable elimination)

Forward Sampling

Cloudy

Sprinkler Rain

WetGrass

Cloudy

Sprinkler Rain

WetGrass

+c 0.5-c 0.5

+c+s 0.1

-s 0.9-c +s 0.5

-s 0.5

+c+r 0.8

-r 0.2-c +r 0.2

-r 0.8

+r+w 0.99

-w 0.01

+w 0.90

-w 0.10

-s +r +w 0.90-w 0.10

-r +w 0.01-w 0.99

Samples:

+c, -s, +r, +w-c, +s, -r, +w

Forward Sampling

• This process generates samples with probability:

…i.e. the BN’s joint probability

• If we sample n events, then in the limit we would converge to the true joint probabilities.

• Therefore, the sampling procedure is consistent

Example

• We’ll get a bunch of samples from the BN:+c, -s, +r, +w

+c, +s, +r, +w

-c, +s, +r, -w

+c, -s, +r, +w

-c, -s, -r, +w

• If we want to know P(W)– We have counts <+w:4, -w:1>– Normalize to get P(W) = <+w:0.8, -w:0.2>– This will get closer to the true distribution with more samples– Can estimate anything else, too– What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)?– Fast: can use fewer samples if less time (what’s the drawback?)

Cloudy

Sprinkler Rain

WetGrass

Rejection Sampling

• Let’s say we want P(C)– No point keeping all samples around– Just tally counts of C as we go

• Let’s say we want P(C| +s)– Same thing: tally C outcomes, but

ignore (reject) samples which don’t have S=+s

– This is called rejection sampling– It is also consistent for conditional

probabilities (i.e., correct in the limit)

+c, -s, +r, +w

+c, +s, +r, +w

-c, +s, +r, -w

+c, -s, +r, +w

-c, -s, -r, +w

Cloudy

Sprinkler Rain

WetGrass

Likelihood Weighting

• Problem with rejection sampling:– If evidence is unlikely, you reject a lot of samples– You don’t exploit your evidence as you sample– Consider P(B|+a)

• Idea: fix evidence variables and sample the rest

• Problem: sample distribution not consistent!• Solution: weight by probability of evidence given parents

Burglary Alarm

-b, -a -b, -a -b, -a -b, -a+b, +a

-b +a -b, +a -b, +a -b, +a+b, +a

• Sampling distribution if z sampled and e fixed evidence

• Now, samples have weights

• Together, weighted sampling distribution is consistent

Cloudy

+c 0.5-c 0.5

+c+s 0.1

-s 0.9-c +s 0.5

-s 0.5

+c+r 0.8

-r 0.2-c +r 0.2

-r 0.8

+r+w 0.99

-w 0.01

+w 0.90

-w 0.10

-s +r +w 0.90-w 0.10

-r +w 0.01-w 0.99

Samples:

+c, +s, +r, +w…

Cloudy

Sprinkler Rain

WetGrass

Cloudy

Sprinkler Rain

WetGrass

Inference: Sum over weights that match query value Divide by total sample weight What is P(C|+w,+r)?

Likelihood Weighting Example

Cloudy Rainy Sprinkler Wet Grass Weight0 1 1 1 0.4950 0 1 1 0.450 0 1 1 0.450 0 1 1 0.451 0 1 1 0.09

Likelihood Weighting• Likelihood weighting is good

– We have taken evidence into account as we generate the sample

– E.g. here, W’s value will get picked based on the evidence values of S, R

– More of our samples will reflect the state of the world suggested by the evidence

• Likelihood weighting doesn’t solve all our problems– Evidence influences the choice of

downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence)

• We would like to consider evidence when we sample every variable 42

Cloudy

Summary

• Sampling can be really effective– The dominating approach to inference in BNs

• Approaches:– Forward (/Prior) Sampling– Rejection Sampling– Likelihood Weighted Sampling

Bayesian decision making

• Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E

• Inference problem: given some evidence E = e, what is P(X | e)?

• Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x1,e1), …, (xn,en)}

Learning in Bayes Nets

• Task 1: Given the network structure and given data (observed settings for the variables) learn the CPTs for the Bayes Net.

• Task 2: Given only the data (and possibly a prior over Bayes Nets), learn the entire Bayes Net (both Net structure and CPTs).

Parameter learning• Suppose we know the network structure (but not

the parameters), and have a training set of complete observations

Sample C S R W

1 T F T T

2 F T F T

3 T F F F

4 T T T T

5 F T F T

6 T F T F

… … … …. …

Training set

Parameter learning• Suppose we know the network structure (but not

the parameters), and have a training set of complete observations– P(X | Parents(X)) is given by the observed

frequencies of the different values of X for each combination of parent values

Parameter learning• Incomplete observations

• Expectation maximization (EM) algorithm for dealing with missing data

Sample C S R W

1 ? F T T

2 ? T F T

3 ? F F F

4 ? T T T

5 ? T F T

6 ? F T F

… … … …. …

Training set

Learning from incomplete data

• Observe real-world data– Some variables are

unobserved– Trick: Fill them in

• EM algorithm– E=Expectation– M=Maximization

Sample C S R W

1 ? F T T

2 ? T F T

3 ? F F F

4 ? T T T

5 ? T F T

6 ? F T F

… … … …. …

EM-Algorithm

• Initialize CPTs (e.g. randomly, uniformly)• Repeat until convergence

– E-Step:• Compute for all i and x (inference)

– M-Step:• Update CPT tables

Parameter learning

• What if the network structure is unknown?– Structure learning algorithms exist, but they are pretty

complicated…

Sample C S R W

1 T F T T

2 F T F T

3 T F F F

4 T T T T

5 F T F T

6 T F T F

… … … …. …

Training setC

Summary: Bayesian networks

• Joint Distribution• Independence• Inference• Parameters• Learning

Naïve Bayes

Bayesian inference, Naïve Bayes model

http://xkcd.com/1236/

Bayes Rule

• The product rule gives us two ways to factor a joint probability:

• Therefore,

• Why is this useful?– Key tool for probabilistic inference: can get diagnostic probability

from causal probability • E.g., P(Cavity | Toothache) from P(Toothache | Cavity)

)()|()()|(),( APABPBPBAPBAP

)()|()|(

APABPBAP

Rev. Thomas Bayes(1702-1761)

Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony

in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Bayes Rule example• Marie is getting married tomorrow, at an outdoor ceremony

in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

)Predict(

)Rain()Rain|Predict()Predict|Rain(

111.00986.00126.0

0126.0

986.01.0014.09.0

014.09.0

)Rain()Rain|Predict()Rain()Rain|Predict(

)Rain()Rain|Predict(

Probabilistic inference• Suppose the agent has to make a decision about

the value of an unobserved query variable X given some observed evidence variable(s) E = e

– Examples: X = {spam, not spam}, e = email messageX = {zebra, giraffe, hippo}, e = image features

Bayesian decision theory• Let x be the value predicted by the agent and x* be

the true value of X. • The agent has a loss function, which is 0 if x = x*

and 1 otherwise• Expected loss:

• What is the estimate of X that minimizes the expected loss?– The one that has the greatest posterior probability P(x|e)– This is called the Maximum a Posteriori (MAP) decision

)|*(1)|()|()*,(*

exPexPexPxxLxxx

MAP decision• Value x of X that has the highest posterior

probability given the evidence E = e:

• Maximum likelihood (ML) decision:

)|(maxarg* xePx x

)()|()|( xPxePexP likelihood priorposterior

)()|()|(maxarg*

xXPxXeEPeExXPx x

)()|(maxarg xXPxXeEPx

Naïve Bayes model• Suppose we have many different types of observations

(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X

• MAP decision involves estimating

– If each feature Ei can take on k values, how many entries are in the (conditional) joint probability table P(E1, …, En |X = x)?

)()|,,(),,|( 11 XPXEEPEEXP nn

Naïve Bayes model• Suppose we have many different types of observations

(symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X

• MAP decision involves estimating

• We can make the simplifying assumption that the different features are conditionally independent given the hypothesis:

– If each feature can take on k values, what is the complexity of storing the resulting distributions?

)()|,,(),,|( 11 XPXEEPEEXP nn

iin XEPXEEP

11 )|()|,,(

Exchangeability

De Finetti Theorem of exchangeability (bag of words theorem): the joint probability distribution underlying the data is invariant to permutation.

Naïve Bayes model• Posterior:

• MAP decision:

xXeEPxXP

xXeEeEPxXP

eEeExXP

)|,,()(

likelihoodpriorposterior

iix xePxPexPx

)|()()|(maxarg*

A Simple Example – Naïve Bayes

C – Class F - Features

We only specify (parameters): prior over class labels

how each feature depends on the class

Slide from Dan Klein

Spam filter• MAP decision: to minimize the probability of error, we should

classify a message as spam if P(spam | message) > P(¬spam | message)

Spam filter• MAP decision: to minimize the probability of error, we should

classify a message as spam if P(spam | message) > P(¬spam | message)

• We have P(spam | message) P(message | spam)P(spam) and P(¬spam | message) P(message | ¬spam)P(¬spam)

• To enable classification, we need to be able to estimate the likelihoods P(message | spam) and P(message | ¬spam) and priors P(spam) and P(¬spam)

Naïve Bayes Representation• Goal: estimate likelihoods P(message | spam) and

P(message | ¬spam) and priors P(spam) and P(¬spam)• Likelihood: bag of words representation

– The message is a sequence of words (w1, …, wn)

– The order of the words in the message is not important– Each word is conditionally independent of the others given

message class (spam or not spam)

Naïve Bayes Representation• Goal: estimate likelihoods P(message | spam) and

P(message | ¬spam) and priors P(spam) and P(¬spam)• Likelihood: bag of words representation

– The message is a sequence of words (w1, …, wn)

– The order of the words in the message is not important– Each word is conditionally independent of the others given

message class (spam or not spam)

– Thus, the problem is reduced to estimating marginal likelihoods of individual words P(wi | spam) and P(wi | ¬spam)

iin spamwPspamwwPspammessageP

11 )|()|,,()|(

• General MAP rule for Naïve Bayes:

• Thus, the filter should classify the message as spam if

Summary: Decision rule

iin spamwPspamPwwspamP

11 )|()(),,|(

likelihoodpriorposterior

iix xePxPexPx

)|()()|(maxarg*

ii spamwPspamPspamwPspamP

)|()()|()(

Percentage of documents in training set labeled as spam/ham

In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words).

In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words).

Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course...

Documents

Segmentation slides adopted from Svetlana Lazebnik

CSP Examples Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew

Games Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew

Color and Radiometry Digital Image Synthesis Yung-Yu Chuang with slides by Svetlana Lazebnik, Pat Hanrahan and Matt Pharr

CS143, Brown James Hays Stereo and Structure from Motion Many slides by Kristen Grauman, Robert Collins, Derek Hoiem, Alyosha Efros, and Svetlana Lazebnik

Piggyback: Adapting a Single Network to Multiple Tasks by ......Arun Mallya, Dillon Davis, and Svetlana Lazebnik University of Illinois at Urbana-Champaign {amallya2,ddavis14,slazebni}@illinois.edu

Introduction to Computer Vision Based on slides by Jinxiang Chai, Svetlana Lazebnik, Guodong Guo Assembled and modified by Longin Jan Latecki September

ISBN volume 978-88-6760-344-2 - edueval.euedueval.eu/download/Handbook/edueval_03-spain.pdf · Marzano, Tamara Pigozne, Svetlana Usca (Rezekne Academy of Technologies, Letonia) 4

Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart

Projekt: „Erforschung der Sprache und der Dialekte der Deutschen … · Tamara Ljutova Dr. Elena Kukina (Kaul) Natalja Balachonova (Kobkova) Dr. Svetlana Polujkova 3.1.2 Omsker

Photo by Svetlana Lazebnik Which parts are hard to model? From Alexei Efros

Bayes Nets & HMMs Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Games Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore,

Search Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Dan Klein, Stuart Russell, Andrew Moore, Svetlana Lazebnik,

Probability Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

lec15 single view - University of Illinois at Urbana–Champaignslazebni.cs.illinois.edu/spring19/lec15_single_view.pdflec15_single_view Author Svetlana Lazebnik Created Date 20190306215224Z

Deep neural networks - Computer Science- UC Davis · Deep neural networks June 1st, 2017 Yong Jae Lee UC Davis Many slides from Rob Fergus, Svetlana Lazebnik, Jia-Bin Huang, Derek

CSPs Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore,

CS440/ECE448 Lecture 21: Markov Decision ProcessesCS440/ECE448 Lecture 21: Markov Decision Processes Slides by Svetlana Lazebnik, 11/2016 Modified by Mark Hasegawa-Johnson, 3/2019

The Beauty of Local Invariant Features Svetlana Lazebnik Beckman Institute, University of Illinois at Urbana-Champaign IMA Recognition Workshop University