Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....

Preview:

Citation preview

Slide 1

Probabilistic Machine Learning

Brigham S. Anderson

School of Computer Science

Carnegie Mellon University

www.cs.cmu.edu/~brigham

brigham@cmu.edu

2

ML: Some Successful Applications

• Learning to recognize spoken words (speech recognition);

• Text categorization (SPAM, newsgroups);

• Learning to play world-class chess, backgammon and checkers;

• Handwriting recognition;

• Learning to classify new astronomical data;

• Learning to detect cancerous tissues (e.g. colon polyp detection).

3

Machine Learning Application Areas

• Science• astronomy, bioinformatics, drug discovery, …

• Business• advertising, CRM (Customer Relationship management),

investments, manufacturing, sports/entertainment, telecom, e-Commerce, targeted marketing, health care, …

• Web: • search engines, bots, …

• Government• law enforcement, profiling tax cheaters, anti-terror(?)

4

Classification Application: Assessing Credit Risk

• Situation: Person applies for a loan• Task: Should a bank approve the loan?• Banks develop credit models using variety of

machine learning methods. • Mortgage and credit card proliferation are the

results of being able to successfully predict if a person is likely to default on a loan

• Widely deployed in many countries

5

Prob. TableAnomaly Detector

• Suppose we have the following model:

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) =

• We’re trying to detect anomalous cars.

• If the next example we see is <good,high>, how anomalous is it?

6

Prob. TableAnomaly Detector

04.0

),(),(

highgoodPhighgoodlikelihood

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) = How likely is

<good,high>?

7How likely is a

<good,high,fast> example?

)|()|()(),,( highfastPgoodhighPgoodPfasthighgoodP

Mpg Horse Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg) P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89

P(Accel|Horse)

039.0

)89.0)(11.0)(4.0(

Bayes NetAnomaly Detector

8

Probability Model Uses

ClassifierData point x

AnomalyDetector

Data point x P(x)

P(C | x)

Inference

Engine

Evidence e1P(E2 | e1) Missing Variables E2

9

Bayes Classifiers

• A formidable and sworn enemy of decision trees

DT BC

ClassifierData point x P(C | x)

10

Dead-SimpleBayes Classifier Example

• Suppose we have the following model:

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) =

• We’re trying to classify cars as Mpg = “good” or “bad”

• If the next example we see is Horse = “low”, how do we classify it?

11

Dead-SimpleBayes Classifier Example

)(

),()|(

lowP

lowgoodPlowgoodP

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) =

),(),(

),(

lowbadPlowgoodP

lowgoodP

739.012.036.0

36.0

How do we classify

<Horse=low>?

The P(good | low) = 0.75,so we classify the example

as “good”

12

Bayes Classifiers

• That was just inference!

• In fact, virtually all machine learning tasks are a form of inference

• Anomaly detection: P(x)• Classification: P(Class | x)• Regression: P(Y | x)• Model Learning: P(Model | dataset)• Feature Selection: P(Model | dataset)

13Suppose we get a

<Horse=low, Accel=fast> example?

),(

),,(),|(

fastlowP

fastlowgoodPfastlowgoodP

Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89

P(Accel|Horse)

),(

)|()|()(

fastlowP

lowfastPgoodlowPgoodP

),(

0178.0

),(

)05.0)(89.0)(4.0(

fastlowPfastlowP

),,(),,(

0178.0

fastlowbadPfastlowgoodP

75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…

14Suppose we get a

<Horse=low, Accel=fast> example?

),(

),,(),|(

fastlowP

fastlowgoodPfastlowgoodP

Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89

P(Accel|Horse)

),(

)|()|()(

fastlowP

lowfastPgoodlowPgoodP

),(

0178.0

),(

)05.0)(89.0)(4.0(

fastlowPfastlowP

),,(),,(

0178.0

fastlowbadPfastlowgoodP

75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…

The P(good | low, fast) = 0.75,so we classify the example

as “good”.

…but that seems somehow familiar…

Wasn’t that the same answer asP(good | low)?

15Suppose we get a

<Horse=low, Accel=fast> example?

),(

),,(),|(

fastlowP

fastlowgoodPfastlowgoodP

Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89

P(Accel|Horse)

16

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …

vny.

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

17

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …

vny.

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

• Mi estimates P(X1, X2, … Xm | Y=vi )

18

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,

… vny.

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

• Mi estimates P(X1, X2, … Xm | Y=vi )

• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely

)|(argmax 11predict vYuXuXPY mm

v

Is this a good idea?

19

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,

… vny.

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

• Mi estimates P(X1, X2, … Xm | Y=vi )

• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely

)|(argmax 11predict vYuXuXPY mm

v

Is this a good idea?

This is a Maximum Likelihood classifier.

It can get silly if some Ys are very unlikely

20

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,

… vny.

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

• Mi estimates P(X1, X2, … Xm | Y=vi )

• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(Y=vi | X1, X2, … Xm) most likely

)|(argmax 11predict

mmv

uXuXvYPY

Is this a good idea?

Much Better Idea

21

Terminology

• MLE (Maximum Likelihood Estimator):

• MAP (Maximum A-Postiori Estimator):

)|(argmax 11predict

mmv

uXuXvYPY

)|(argmax 11predict vYuXuXPY mm

v

22

Getting what we need

)|(argmax 11predict

mmv

uXuXvYPY

23

Getting a posterior probability

Yn

jjjmm

mm

mm

mm

mm

vYPvYuXuXP

vYPvYuXuXP

uXuXP

vYPvYuXuXP

uXuXvYP

111

11

11

11

11

)()|(

)()|(

)(

)()|(

)|(

24

Bayes Classifiers in a nutshell

)()|(argmax

)|(argmax

11

11predict

vYPvYuXuXP

uXuXvYPY

mmv

mmv

1. Learn the distribution over inputs for each value Y.

2. This gives P(X1, X2, … Xm | Y=vi ).

3. Estimate P(Y=vi ). as fraction of records with Y=vi .

4. For a new prediction:

25

Bayes Classifiers in a nutshell

)()|(argmax

)|(argmax

11

11predict

vYPvYuXuXP

uXuXvYPY

mmv

mmv

1. Learn the distribution over inputs for each value Y.

2. This gives P(X1, X2, … Xm | Y=vi ).

3. Estimate P(Y=vi ). as fraction of records with Y=vi .

4. For a new prediction: We can use our favorite Density Estimator here.

Right now we have three options:

• Probability Table• Naïve Density• Bayes Net

26

Joint Density Bayes Classifier

)()|(argmax 11predict vYPvYuXuXPY mm

v

In the case of the joint Bayes Classifier this degenerates to a very simple rule:

Ypredict = the most common value of Y among records in which X1 = u1, X2 = u2, …. Xm = um.

Note that if no records have the exact set of inputs X1 = u1, X2 = u2, …. Xm = um, then P(X1, X2, … Xm | Y=vi ) = 0 for all values of Y.

In that case we just have to guess Y’s value

27

Joint BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier

learned by “Joint BC”

28

Joint BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25

29

30

BC Results: “MPG”: 392

records

The Classifier

learned by “Naive BC”

31

Joint Distribution

Horsepower

Mpg Acceleration

Maker

32

Joint Distribution

P(Mpg, Horse) = P(Mpg) * P(Horse|Mpg)

Recall:A joint distribution can be decomposed via the chain rule…

Note that this takes the same amount of information to create.

We “gain” nothing from this decomposition

33

Naive Distribution

Mpg

Cylinders

P(Mpg)

P(Cylinders|Mpg)

Horsepower

P(Horsepower|Mpg)

Weight

P(Weight|Mpg)

MakerModelyear Acceleration

P(Modelyear|Mpg) P(Maker|Mpg)

P(Acceleration|Mpg)

34

Naïve Bayes Classifier

)()|(argmax 11predict vYPvYuXuXPY mm

v

In the case of the naive Bayes Classifier this can be simplified:

Yn

jjj

vvYuXPvYPY

1

predict )|()(argmax

35

Naïve Bayes Classifier

)()|(argmax 11predict vYPvYuXuXPY mm

v

In the case of the naive Bayes Classifier this can be simplified:

Yn

jjj

vvYuXPvYPY

1

predict )|()(argmax

Technical Hint:If you have 10,000 input attributes that product will underflow in floating point math. You should use logs:

Yn

jjj

vvYuXPvYPY

1

predict )|(log)(logargmax

36

BC Results: “XOR”The “XOR” dataset consists of 40,000 records and 2 boolean inputs called a and b, generated 50-50 randomly as 0 or 1. c (output) = a XOR b

The Classifier

learned by “Naive BC”

The Classifier

learned by “Joint BC”

37

Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier

learned by “Naive BC”

38

Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier

learned by “Joint BC”

This result surprised Andrew until he had thought about it a little

39

Naïve BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25The Classifier

learned by “Naive BC”

40

BC Results: “MPG”: 392

records

The Classifier

learned by “Naive BC”

41

BC Results: “MPG”: 40

records

42

More Facts About Bayes Classifiers

• Many other density estimators can be slotted in*.• Density estimation can be performed with real-valued inputs*• Bayes Classifiers can be built with real-valued inputs*• Rather Technical Complaint: Bayes Classifiers don’t try to be

maximally discriminative---they merely try to honestly model what’s going on*

• Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help*.

• Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully!

*See future Andrew Lectures

43

What you should know

• Probability• Fundamentals of Probability and Bayes Rule• What’s a Joint Distribution• How to do inference (i.e. P(E1|E2)) once you have a JD

• Density Estimation• What is DE and what is it good for• How to learn a Joint DE• How to learn a naïve DE

44

What you should know

• Bayes Classifiers• How to build one• How to predict with a BC• Contrast between naïve and joint BCs

45

Interesting Questions

• Suppose you were evaluating NaiveBC, JointBC, and Decision Trees

• Invent a problem where only NaiveBC would do well• Invent a problem where only Dtree would do well• Invent a problem where only JointBC would do well• Invent a problem where only NaiveBC would do poorly• Invent a problem where only Dtree would do poorly• Invent a problem where only JointBC would do poorly

46

Venn Diagram

47

For more information

• Two nice books• L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.

Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.

• C4.5 : Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning) by J. Ross Quinlan

• Dozens of nice papers, including• Learning Classification Trees, Wray Buntine, Statistics and

Computation (1992), Vol 2, pages 63-73

• Kearns and Mansour, On the Boosting Ability of Top-Down Decision Tree Learning Algorithms, STOC: ACM Symposium on Theory of Computing, 1996“

• Dozens of software implementations available on the web for free and commercially for prices ranging between $50 - $300,000

48

Probability Model Uses

ClassifierInput

Attributes

AnomalyDetector

Data point x P(x | M)

P(C | E)

Inference

Engine

Subset Evidence E1P(E2 | e1)

ClustererData setclustersof points

Variables E2

49

How to Build a Bayes Classifier

Data Set P(I,A,R,C)

This function simulates a four-dimensional lookup table

of the probability of each possible Industry/Analyte/Result/Class

Each record has aclass of either “normal”or “outbreak”

50

How to Build a Bayes Classifier

Data Set

Data SetOutbreaks

Data SetNormals

P(I,A,R,O) P(I,A,R | normal)

51

How to Build a Bayes Classifier

Suppose that a new test result arrives…

<meat, salmonella, negative>

P(meat, salmonella, negative, normal) = 0.19

P(meat, salmonella, negative, outbreak) = 0.005

0.19-------- = 38.00.005

Class = “normal”!

52

How to Build a Bayes Classifier

Next test:

<Seafood, Vibrio, Positive>

P(seafood, vibrio, positive, normal) = 0.02

P(seafood, vibrio, positive, outbreak) = 0.07

0.02------ = 0.290.07

Class = “outbreak”!

Recommended