Intro to Machine Learning Parameter Estimation for Bayes Nets Naïve Bayes

Intro to Machine Learning

Parameter Estimation for Bayes NetsNaïve Bayes

Recall: Key Components of Intelligent Agents

Representation Language: Graph, Bayes Nets

Inference Mechanism: A*, variable elimination, Gibbs sampling

Learning Mechanism: For today!

-------------------------------------Evaluation Metric

Machine Learning

Determining a model for how something works, based on examples of it working (data).

This is a very general problem, that has lots of applications.

Many big companies today are getting rich by doing this well.

Quiz: Companies doing ML

For each company below, think of a type of data that the company can use to learn something useful.

Company What kind of data?

What can the company learn from the data?

Google

Amazon

Netflix

Facebook

Answer: Companies doing ML

For each company below, think of a type of data that the company can use to learn something useful.

Company What kind of data? What can the company learn from the data?

Google Email Spam vs. not spam (ham)

Amazon Who buys what How to recommend products to people

Netflix Who views what How to recommend videos to people

Facebook Who knows whom How to display ads effectively

ML in Research Systems• DARPA Grand Challenge

Some of my own research: Learning to understand English sentences

“Secretary of Energy Steven Chu announced Friday that he was resigning pending the confirmation of a successor.”

System can predict:- announced is an action- Secretary of Energy Steven Chu is a person who is doing the action- Friday is a date/time describing when the action happened- he was resigning pending the confirmation of a successor is the thing being

announced.

Very few AI systems today have no learning component.

Example: Parameter Estimation in BNs

Recall this BN from before.

Let’s pretend now that none of the parameters were given to you.

Happy?

Sunny? Raise?

S P(S)

+s 0.7

R P(R)

+r 0.01

H S R P(H|S,R)

+h +s +r 1.0

+h +s -r 0.7

+h -s +r 0.9

+h -s -r 0.1


How can we figure out what these parameters should be?

The ML answer:1. Collect some data2. Find parameters that explain this data

Happy?

Sunny? Raise?

S P(S)

+s ?

R P(R)

+r ?

H S R P(H|S,R)

+h +s +r ?

+h +s -r ?

+h -s +r ?

+h -s -r ?


Example Data1. +s, -r, +h2. +s, +r, +h3. +s, -r, +h4. -s, -r, -h5. +s, -r, +h6. -s, -r, -h

Happy?

Sunny? Raise?

S P(S)

+s ?

R P(R)

+r ?

H S R P(H|S,R)

+h +s +r ?

+h +s -r ?

+h -s +r ?

+h -s -r ?

Quiz: Parameter Estimation in BNs


Given the data above, what would you estimate forP(+s) = P(+r) = P(+h | +s, -r) =

Happy?

Sunny? Raise?

S P(S)

+s ?

R P(R)

+r ?

H S R P(H|S,R)

+h +s +r ?

+h +s -r ?

+h -s +r ?

+h -s -r ?

Answer: Parameter Estimation in BNs


Given the data above, what would you estimate forP(+s) = 4 / 6 = 0.67P(+r) = 1 / 6 = 0.167P(+h | +s, -r) = 2 / 3 = 0.67

Happy?

Sunny? Raise?

S P(S)

+s ?

R P(R)

+r ?

H S R P(H|S,R)

+h +s +r ?

+h +s -r ?

+h -s +r ?

+h -s -r ?

Maximum Likelihood Parameter Estimation

To estimate a parameter P(X1=a1, …, XN=aN | Y1=b1, …, YM=bM)

Maximum Likelihood Estimation (MLE) Algorithm:1. Cjoint = Count how many times (X1=a1, …, XN=aN,

Y1=b1, …, YM=bM) appears in the dataset.

2. Cmarginal = Count how many times (Y1=b1, …, YM=bM) appears in the dataset

3. Set the parameter = Cjoint / Cmarginal

Quiz: MLE

What’s the difference between MLE and rejection sampling?

Answer: MLE

What’s the difference between MLE and rejection sampling?

The parameter estimation procedure is the same, but rejection sampling gets its samples by generating them from the Bayes Net. This requires knowing the parameters of the BN.

MLE gets its samples from some external source.

Where does data come from?This is a fundamental practical consideration for machine learning.

The answer: wherever you can get it the easiest.

Some examples: - For medical diagnosis ML systems, the system needs examples of X-ray images that

are labeled with a diagnosis, e.g. “bone broken” or “bone not broken”. Typically, this data needs to be gotten from a “human expert”, in this case a doctor trained in radiology. These people’s time is EXPENSIVE, so there’s usually not a lot of this data available.

- For speech recognition ML systems, the system needs examples of speech recordings (audio files), labeled with the corresponding English words. You can pay users of Amazon’s Mechanical Turk a couple of pennies per example to label these audio files.

- “Language models” are systems that are really important for processing human language. These systems try to predict the next word in a sequence of words. These systems need examples of English sentences. There are billions of such sentences available on the Web and other places; to get these, you just need to write some software to crawl the Web and grab the sentences.

Likelihood

Likelihood is a term to refer to the following probability:

P(D | M), where D is your data, and M is your model (in our case, a Bayes Net).

When the data consists of multiple examples, most often (but not always) ML assumes that these examples are independent.

This means we can re-write the likelihood like this:P(d1, …, dk | M) = P(d1 | M) P(d2 | M) P(dk | M)

Quiz: Likelihood

Example Data1. +s, -r, +h2. +s, +r, +h

What is the likelihood of this data, given this BN?Happy?

Sunny? Raise?

S P(S)

+s 0.7

R P(R)

+r 0.01

H S R P(H|S,R)

+h +s +r 1.0

+h +s -r 0.7

+h -s +r 0.9

+h -s -r 0.1

Answer: Likelihood

Example Data1. +s, -r, +h2. +s, +r, +h

What is the likelihood of this data, given this BN?

Likelihood is P(D | BN)= P(d1 | BN) * P(d2 | BN)= P(+s, -r, +h) * P(+s, +r, +h)= P(+s)P(-r)P(+h | +s, -r) *

P(+s)P(+r)P(+h | +s, +r)= .7*.99*.7 * .7*.01*1 = .0034

Happy?

Sunny? Raise?

S P(S)

+s 0.7

R P(R)

+r 0.01

H S R P(H|S,R)

+h +s +r 1.0

+h +s -r 0.7

+h -s +r 0.9

+h -s -r 0.1

Maximum Likelihood

“Maximum Likelihood Estimation” is called that because the parameters it finds for dataset D are the parameters that make P(D | BN) biggest.

Data1. +s, -r, +h2. +s, +r, +h3. +s, -r, +h4. -s, -r, -h5. +s, -r, +h6. -s, -r, -h

Let m be the maximum likelihood estimate for P(S).

P(D | BN) = m * P(-r) * P(-h|+s, -r) * m * P(+r) * P(+h | +s, +r) * m * P(-r) * P(+h | +s, -r) * (1-m) * P(-r) * P(-h | -s, -r) * m * P(-r) * P(+h | +s, -r) * (1-m) * P(-r) * P(-h | -s, -r)

Maximum LikelihoodMathematical trick: Finding the biggest point of f(m) is equivalent to finding the biggest point of log f(m).

P(D | BN) = m * P(-r) * P(-h|+s, -r) * m * P(+r) * P(+h | +s, +r) * m * P(-r) * P(+h | +s, -r) * (1-m) * P(-r) * P(-h | -s, -r) * m * P(-r) * P(+h | +s, -r) * (1-m) * P(-r) * P(-h | -s, -r)

l og𝑃 (𝐷|𝐵𝑁 ¿¿=¿ log𝑚+log 𝑃 (−𝑟 )+ log 𝑃 (−h|+𝑠 ,−𝑟 ¿ ¿+ log𝑚+ log 𝑃 (+𝑟 )+ log𝑃 (+h∨+𝑠 ,+𝑟 )¿

+log𝑚+ log 𝑃 (−𝑟 )+log 𝑃 (+h∨+𝑠 ,−𝑟 )+ log(1−𝑚)+log 𝑃 (−𝑟 )+ log 𝑃 (−h∨−𝑠 ,−𝑟 )


¿

Maximum LikelihoodTo find the largest point of P(D | BN), we’ll take the derivative:

𝑑 log 𝑃 (𝐷|𝐵𝑁 ¿𝑑𝑚

¿=¿ 1𝑚

¿+ 1𝑚

¿

+1𝑚

−1

1−𝑚+1𝑚

−1

1−𝑚

¿

l og𝑃 (𝐷|𝐵𝑁 ¿¿=¿ log𝑚+log 𝑃 (−𝑟 )+ log 𝑃 (−h|+𝑠 ,−𝑟 ¿ ¿+ log𝑚+ log 𝑃 (+𝑟 )+ log𝑃 (+h∨+𝑠 ,+𝑟 )¿



¿

Maximum LikelihoodTo find the largest point of P(D | BN), we’ll set the derivative equal to zero:

𝑑 log 𝑃 (𝐷|𝐵𝑁 ¿𝑑𝑚

¿=¿ 4𝑚−

21−𝑚

=0¿¿¿4𝑚

=2

1−𝑚

4−4𝑚=2𝑚

𝑚=46=23

Notice: This is the same value that you got by doing the MLE algorithm!

More typical ML example: Spam Detection

Dear Anita, We love our customers! To show our appreciation, Pita Delite is happy to announce the $5.99 Meal Deal. For only $5.99, get a sandwich (hot or cold), one side and a drink at Pita Delite. Offer valid in the Greensboro and High Point locations until February 28, 2013. We hope to see you soon! The Pita Delite Team PS Like us Facebook and follow us on Twitter for special offers.

...and here's a gift from us to you. Enter the discount code LOVE2LOVE during the checkout and get 10% off your purchase.

I'm sorry for how late this email is. I was planning on getting youinformation sooner but I have been very busy. Here is my resume, itshould be enough information about me that would be useful in a letterof recommendation. Thank you for doing this.

SPAM SPAM

HAM!

http://r20.rs6.net/tn.jsp?e=001fniYSWZFhFT0f5040p4HkUW42KpA8jtvFyYiQaS8P-qC8TySKIpk9jPcPRCl5kL0b8CpgnU279WGlZ6z9sIHiWpNCc9IZEiTohybDqIHiBjIK2VE5JapCkRf7z7I8DfT

http://r20.rs6.net/tn.jsp?e=001fniYSWZFhFTHunUPGwOCt-Ftvw4osAJX_IklBwSXXOi1nDfkI6k3zAgOjdT9rlkqw9CC7lIB1J4PGK_BJda86Mx81mIVsWEM5VR4rZ69DoljGbTxoCUKhhpAmj3wJ_qO

Email users supply labeled dataInbox Report Spam?

From: Mom, Subject: coming over for dinner tonight?

From: student, Subject: help with assignment

From: Pita Delite, Subject: $5.99 Deal Meal from Pita Delite

From: Cash Loan Providers, Subject: Big Bills and No Way to pay them?

From: AAAI-13, Subject: tentative assignment

…

X

X

X Yf

Building a classifier, Step 1

Text is complicated, and systems aren’t good enough yet to be able to understand it.

Step 1 is to simplify the X variable to something that is computationally easy to handle.

“Bag of Words” Representation

1. Construct a dictionary, which contains the set of distinct words in all of your examples.


I'm sorry for how late this email is. I was planning on getting youinformation sooner but I have been busy. …

Dear Anita, We love our customers! To show our appreciation, Pita Delite is happy to announce the $5.99 Meal Deal. …

Dictionarydearanitaweloveourcustomerstoshowappreciationpita

deliteishappyannouncethe$5.99mealdealandhere’sa

giftfromusyouenterdiscountcodelove2loveduringcheckoutget

10%offyourpurchasei’msorryforhowlatethisemail

iwasplanningongettinginformationsoonerbuthavebeenbusy


2. For each email, for each word w in the dictionary, count how many times w appears in the email.

Dear Anita, We love our customers! To show our appreciation, Pita Delite is happy to announce the $5.99 Meal Deal. …

dear we to show … gift 10% … sorry late

1 1 2 1 0 0 0 0




1 1 2 1 0 0 0 0

0 0 1 0 1 1 0 0





1 1 2 1 0 0 0 0

0 0 1 0 1 1 0 0

0 0 0 0 0 0 1 1

I'm sorry for how late this email is. I was planning on getting youinformation sooner but I have been busy. …

“Bag of Words” Representationdear we to show … gift 10% … sorry late Spam?

1 1 2 1 0 0 0 0 +spam

0 0 1 0 1 1 0 0 +spam

0 0 0 0 0 0 1 1 +ham

X now consists of a number of numerical “features” or “attributes”, X1 up to XN.

We’ll use these features to construct the classifier.

You can think of each of these features as an observable random variable.

Quiz: Bag of words

Below are three (contrived) email messages. Construct a bag of words representation for each.

Message

i love sports

sports fans love energy drink

you will love this energy drink

Quiz: Bag of words (answer)

Below are three (contrived) email messages. Construct a bag of words representation for each.

Message

i love sports

sports fans love energy drink

you will love this energy drink

i love sports fans energy drink you will this Spam?

1 1 1 0 0 0 0 0 0

0 1 1 1 1 1 0 0 0

0 1 0 0 1 1 1 1 1

Quiz: Bayesian spam classifier

If you had to come up with a Bayes Net to predict which of these messages was Spam, what would it look like?


1 1 1 0 0 0 0 0 0 +ham

0 1 1 1 1 1 0 0 0 +spam

0 1 0 0 1 1 1 1 1 +spam

Answer: Bayesian spam classifier

If you had to come up with a Bayes Net to predict which of these messages was Spam, what would it look like?

Lots of possible answers for this, I’ll show a common kind of BN used for this next.


1 1 1 0 0 0 0 0 0 +ham

0 1 1 1 1 1 0 0 0 +spam

0 1 0 0 1 1 1 1 1 +spam

Building a classifier, Step 2

Once you’ve got a set of features for your examples, it’s time to decide on what type of classifier you’d like to use.

Technically, this is called choosing a hypothesis space – a set (or “space”) of possible classifiers (or “hypotheses”).

Bayes Nets can make fine classifiers. However, the space of ALL Bayes Nets is too big for building a good spam detector. We’re going to restrict our attention to a special class of Bayes Nets called Naïve Bayes models.

Naïve Bayes Classifier

Naïve Bayes is a simple and widely-used model in ML for many different problems.

It is a Bayes Net with one parent node and N children. The children are typically observable, and the parent is typically unobservable.

Y

X1 XNX2…

…

Notice the conditional independence assumption:Each Xi is conditionally independent of every Xj, given Y.

Learning a Naïve Bayes Classifier

Parameter Estimation for NBCs are the same as for other BNs.

To simplify our problem, we’ll assume all Xi variables are boolean (1 or 0).

Quiz: Learning a Naïve Bayes Classifier

How many parameters do we need to learn for our NBC for spam detection?

If we use MLE, what parameter would be learned for P(+spam)?

How about for P(+energy | +spam)?

How about for P(+will | +ham)?


1 1 1 0 0 0 0 0 0 +ham

0 1 1 1 1 1 0 0 0 +spam

0 1 0 0 1 1 1 1 1 +spam

Answer: Learning a Naïve Bayes Classifier

How many parameters do we need to learn for our NBC for spam detection? 19 = 1 (+spam) + 9 (+word | +spam) + 9 (+word | +ham)

If we use MLE, what parameter would be learned for P(+spam)?2/3How about for P(+energy | +spam)?2/2 = 1.0How about for P(+will | +ham)?1/2 = 0.5


1 1 1 0 0 0 0 0 0 +ham

0 1 1 1 1 1 0 0 0 +spam

0 1 0 0 1 1 1 1 1 +spam

Quiz: Prediction with an NBC Spam? P(S)

+spam 0.67

+ham 0.33

Word: i love sports fans energy drink you will this

P(+word|+spam): 0 1 0.5 0.5 1 1 0.5 0.5 0.5

P(+word|+ham): 1 1 1 0 0 0 0 0 0

Spam

I thislove …

…

What is P(Spam | “sports fans love this”)?

Answer: Prediction with an NBC Spam? P(S)

+spam 0.67

+ham 0.33


P(+word|+spam): 0 1 0.5 0.5 1 1 0.5 0.5 0.5

P(+word|+ham): 1 1 1 0 0 0 0 0 0

Spam

I thislove …

…

What is P(+spam | “sports fans love this”)?= P(+spam, “sports fans love this”) / P(“sports fans love this”)

= P(+spam) * P(sports | +spam)* P(fans | +spam)* P(love | +spam)*P(this | +spam)

/ [ P(+spam) * P(sports | +spam)

* P(fans | +spam)* P(love | +spam)*P(this | +spam)

+ P(+ham) * P(sports | + ham)

* P(fans | + ham)* P(love | + ham)*P(this | + ham)

]

Answer: Prediction with an NBC Spam? P(S)

+spam 0.67

+ham 0.33


P(+word|+spam): 0 1 0.5 0.5 1 1 0.5 0.5 0.5

P(+word|+ham): 1 1 1 0 0 0 0 0 0

Spam

I thislove …

…

What is P(+spam | “sports fans love this”)?= P(+spam) * P(+sports | +spam)

* P(+fans | +spam)* P(+love | +spam)*P(+this | +spam)

/ [ P(+spam) * P(+sports | +spam)


+ P(+ham) * P(+sports | + ham)

* P(+fans | + ham)* P(+love | + ham)*P(+this | + ham)

]= .67 * .5 * .5 * 1 * .5 / [.67 * .5 * .5 * 1 * .5 + .33 * 1 * 0 * 1 * 0]= 1.0

Note: the true answer would also include terms for P(-i | +spam), P(-energy | +spam), P(-drink | +spam), etc. I left them out for brevity.

Overfitting

Overfitting occurs when a statistical model (aka, a “classifier” in ML) describes random error or noise instead of the underlying relationship.

Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.

Overfitting our NBCOur model has overfit. For instance, it believes there is ZERO CHANCE of seeing “I” in a spam message.

This is true in the 3 training messages, but it’s too strong of a conclusion to make from just 3 training examples.

It leads to poor predictions on new examples, such as P(+spam | “I love energy drink”)

Spam? P(S)

+spam 0.67

+ham 0.33


P(+word|+spam): 0 1 0.5 0.5 1 1 0.5 0.5 0.5

P(+word|+ham): 1 1 1 0 0 0 0 0 0

Spam

thislove …

…

I

Laplace Smoothing

For binary variable X,

MLE from N examples:

Laplace-smoothed estimate: pretend we start with 2K (fake) examples, half with +x and half with –x.

Quiz: Laplace smoothing

Let K=1.

Assume our training data contains

1 example, of which 1 is +spam. P(+spam)=?10 examples, 4 of which are +spam. P(+spam)=?100 examples, 40 of which are +spam. P(+spam)=?1000 examples, 400 of which are +spam. P(+spam)=?

Answers: Laplace smoothingLet K=1.

Assume our training data contains

1 example, of which 1 is Spam. P(Spam)=(Count(spam)+1) / (N+2) = (1+1) / (1+2)= 2/3

10 examples, 4 of which are Spam. P(Spam)=(4+1) / (10+2) = 5/12 = 0.41666667

100 examples, 40 of which are Spam. P(Spam)= (40+1) / (100+2) = 41/102 = 0.401961

1000 examples, 400 of which are Spam. P(Spam)= (400+1) / 1000+2) = 401/1002 = 0.4001960

As the number of training examples increases, the Laplace smoothing has a smaller and smaller effect. It’s only when there’s not much training data that it has a big effect.

Quiz: Laplace SmoothingSpam? P(S)

+spam

+ham


P(+word|+spam):

P(+word|+ham):

Spam

thislove …

… Fill in the parameters using Laplace smoothing, with K=1.

I


1 1 1 0 0 0 0 0 0 +ham

0 1 1 1 1 1 0 0 0 +spam

0 1 0 0 1 1 1 1 1 +spam

Answers: Laplace SmoothingSpam? P(S)

+spam .6

+ham .4

Word: i Love sports fans energy drink you will this

P(+word|+spam): .25 .75 .5 .5 .75 .75 .5 .5 .5

P(+word|+ham): .67 .67 .67 .33 .33 .33 .33 .33 .33

Spam

thislove …

… Fill in the parameters using Laplace smoothing, with K=1.

I


1 1 1 0 0 0 0 0 0 +ham

0 1 1 1 1 1 0 0 0 +spam

0 1 0 0 1 1 1 1 1 +spam

Quiz: Laplace SmoothingSpam? P(S)

+spam .6

+ham .4


P(+word|+spam): .25 .75 .5 .5 .75 .75 .5 .5 .5

P(+word|+ham): .67 .67 .67 .33 .33 .33 .33 .33 .33

Spam

thislove …

…

I

What is P(+spam | “sports fans love this”)?

Answer: Laplace SmoothingSpam? P(S)

+spam .6

+ham .4


P(+word|+spam): .25 .75 .5 .5 .75 .75 .5 .5 .5

P(+word|+ham): .67 .67 .67 .33 .33 .33 .33 .33 .33

Spam

thislove …

…

I

What is P(+spam | “sports fans love this”)?= P(+spam) * P(+sports | +spam)


/ [ P(+spam) * P(+sports | +spam)


+ P(+ham) * P(+sports | + ham)

* P(+fans | + ham)* P(+love | + ham)*P(+this | + ham)

]= .6 * .5 * .75 * .5 / [.6 * .5 * .75 * .5 + .4 * .5 * .5 * .75 * .5]= .75 (not 1.0, like before)

Note: the true answer would also include terms for P(-i | +spam), P(-energy | +spam), P(-drink | +spam), etc. I left them out for brevity.

Types of LearningThe techniques we have discussed so far are examples of a particular kind of learning:

Supervised: the training examples included the correct labels or outputs.Vs. Unsupervised (or semi-supervised, or distantly-supervised, …): None (or some, or only part, …) of the labels in the training data are known.

Parameter Estimation: We only tried to learn the parameters in the BN, not the structure of the BN graph.Vs. Structure learning: The BN graph is not given as an input, and the learning algorithm’s job is to figure out what the graph should look like.

The distinctions below aren’t actually about the learning algorithm itself, but rather about the type of model being learned:

Classification: the output is a discrete value, like Happy or not Happy, or Spam or Ham.Vs. Regression: the output is a real number.

Generative: The model of the data represents a full joint distribution over all relevant variables.Vs. Discriminative: The model assumes some fixed subset of the variables will always be “inputs” or “evidence”, and it creates a distribution for the remaining variables conditioned on the evidence variables.

Parametric vs. Nonparametric: I will explain this later.

We won’t talk much about structure learning, but we will cover some other kinds of learning (regression, unsupervised, discriminative, nonparameteric, …) in later lectures.

Documents

Intro to Machine Learning Parameter Estimation for Bayes Nets Naïve Bayes