Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....

Probabilistic Machine Learning

Brigham S. Anderson

School of Computer Science

Carnegie Mellon University

www.cs.cmu.edu/~brigham

brigham@cmu.edu

ML: Some Successful Applications

• Learning to recognize spoken words (speech recognition);

• Text categorization (SPAM, newsgroups);

• Learning to play world-class chess, backgammon and checkers;

• Handwriting recognition;

• Learning to classify new astronomical data;

• Learning to detect cancerous tissues (e.g. colon polyp detection).

Machine Learning Application Areas

• Science• astronomy, bioinformatics, drug discovery, …

• Business• advertising, CRM (Customer Relationship management),

investments, manufacturing, sports/entertainment, telecom, e-Commerce, targeted marketing, health care, …

• Web: • search engines, bots, …

• Government• law enforcement, profiling tax cheaters, anti-terror(?)

Classification Application: Assessing Credit Risk

• Situation: Person applies for a loan• Task: Should a bank approve the loan?• Banks develop credit models using variety of

machine learning methods. • Mortgage and credit card proliferation are the

results of being able to successfully predict if a person is likely to default on a loan

• Widely deployed in many countries

Prob. TableAnomaly Detector

• Suppose we have the following model:

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) =

• We’re trying to detect anomalous cars.

• If the next example we see is <good,high>, how anomalous is it?

Prob. TableAnomaly Detector

),(),(

highgoodPhighgoodlikelihood

P(Mpg, Horse) = How likely is

<good,high>?

7How likely is a

<good,high,fast> example?

)|()|()(),,( highfastPgoodhighPgoodPfasthighgoodP

Mpg Horse Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg) P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89

P(Accel|Horse)

)89.0)(11.0)(4.0(

Bayes NetAnomaly Detector

Probability Model Uses

ClassifierData point x

AnomalyDetector

Data point x P(x)

P(C | x)

Inference

Engine

Evidence e1P(E2 | e1) Missing Variables E2

Bayes Classifiers

• A formidable and sworn enemy of decision trees

ClassifierData point x P(C | x)

Dead-SimpleBayes Classifier Example

• Suppose we have the following model:

P(Mpg, Horse) =

• We’re trying to classify cars as Mpg = “good” or “bad”

• If the next example we see is Horse = “low”, how do we classify it?

Dead-SimpleBayes Classifier Example

),()|(

lowgoodPlowgoodP

P(Mpg, Horse) =

),(),(

lowbadPlowgoodP

lowgoodP

739.012.036.0

How do we classify

<Horse=low>?

The P(good | low) = 0.75,so we classify the example

as “good”

Bayes Classifiers

• That was just inference!

• In fact, virtually all machine learning tasks are a form of inference

• Anomaly detection: P(x)• Classification: P(Class | x)• Regression: P(Y | x)• Model Learning: P(Model | dataset)• Feature Selection: P(Model | dataset)

13Suppose we get a

<Horse=low, Accel=fast> example?

),,(),|(

fastlowP

fastlowgoodPfastlowgoodP

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P(Accel|Horse)

)|()|()(

fastlowP

lowfastPgoodlowPgoodP

0178.0

)05.0)(89.0)(4.0(

fastlowPfastlowP

),,(),,(

0178.0

fastlowbadPfastlowgoodP

75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…

14Suppose we get a

),,(),|(

fastlowP

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P(Horse|Mpg)

P(Accel|Horse)

)|()|()(

fastlowP

lowfastPgoodlowPgoodP

0178.0

)05.0)(89.0)(4.0(

fastlowPfastlowP

),,(),,(

0178.0

fastlowbadPfastlowgoodP

75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…

The P(good | low, fast) = 0.75,so we classify the example

as “good”.

…but that seems somehow familiar…

Wasn’t that the same answer asP(good | low)?

15Suppose we get a

),,(),|(

fastlowP

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P(Horse|Mpg)

P(Accel|Horse)

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …

• Mi estimates P(X1, X2, … Xm | Y=vi )

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,

… vny.

• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely

)|(argmax 11predict vYuXuXPY mm

Is this a good idea?

… vny.

• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely

This is a Maximum Likelihood classifier.

It can get silly if some Ys are very unlikely

… vny.

• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(Y=vi | X1, X2, … Xm) most likely

)|(argmax 11predict

uXuXvYPY

Much Better Idea

Terminology

• MLE (Maximum Likelihood Estimator):

• MAP (Maximum A-Postiori Estimator):

)|(argmax 11predict

uXuXvYPY

Getting what we need

)|(argmax 11predict

uXuXvYPY

Getting a posterior probability

vYPvYuXuXP

uXuXvYP

Bayes Classifiers in a nutshell

)()|(argmax

)|(argmax

11predict

vYPvYuXuXP

uXuXvYPY

1. Learn the distribution over inputs for each value Y.

2. This gives P(X1, X2, … Xm | Y=vi ).

3. Estimate P(Y=vi ). as fraction of records with Y=vi .

4. For a new prediction:

Bayes Classifiers in a nutshell

)()|(argmax

)|(argmax

11predict

vYPvYuXuXP

uXuXvYPY

1. Learn the distribution over inputs for each value Y.

2. This gives P(X1, X2, … Xm | Y=vi ).

3. Estimate P(Y=vi ). as fraction of records with Y=vi .

4. For a new prediction: We can use our favorite Density Estimator here.

Right now we have three options:

• Probability Table• Naïve Density• Bayes Net

Joint Density Bayes Classifier

)()|(argmax 11predict vYPvYuXuXPY mm

In the case of the joint Bayes Classifier this degenerates to a very simple rule:

Ypredict = the most common value of Y among records in which X1 = u1, X2 = u2, …. Xm = um.

Note that if no records have the exact set of inputs X1 = u1, X2 = u2, …. Xm = um, then P(X1, X2, … Xm | Y=vi ) = 0 for all values of Y.

In that case we just have to guess Y’s value

Joint BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier

learned by “Joint BC”

Joint BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25

BC Results: “MPG”: 392

records

The Classifier

learned by “Naive BC”

Joint Distribution

Horsepower

Mpg Acceleration

Joint Distribution

P(Mpg, Horse) = P(Mpg) * P(Horse|Mpg)

Recall:A joint distribution can be decomposed via the chain rule…

Note that this takes the same amount of information to create.

We “gain” nothing from this decomposition

Naive Distribution

Cylinders

P(Mpg)

P(Cylinders|Mpg)

Horsepower

P(Horsepower|Mpg)

Weight

P(Weight|Mpg)

MakerModelyear Acceleration

P(Modelyear|Mpg) P(Maker|Mpg)

P(Acceleration|Mpg)

Naïve Bayes Classifier

In the case of the naive Bayes Classifier this can be simplified:

vvYuXPvYPY

predict )|()(argmax

Naïve Bayes Classifier

In the case of the naive Bayes Classifier this can be simplified:

vvYuXPvYPY

predict )|()(argmax

Technical Hint:If you have 10,000 input attributes that product will underflow in floating point math. You should use logs:

vvYuXPvYPY

predict )|(log)(logargmax

BC Results: “XOR”The “XOR” dataset consists of 40,000 records and 2 boolean inputs called a and b, generated 50-50 randomly as 0 or 1. c (output) = a XOR b

The Classifier

Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier

Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier

This result surprised Andrew until he had thought about it a little

Naïve BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25The Classifier

records

The Classifier

records

More Facts About Bayes Classifiers

• Many other density estimators can be slotted in*.• Density estimation can be performed with real-valued inputs*• Bayes Classifiers can be built with real-valued inputs*• Rather Technical Complaint: Bayes Classifiers don’t try to be

maximally discriminative---they merely try to honestly model what’s going on*

• Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help*.

• Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully!

*See future Andrew Lectures

What you should know

• Probability• Fundamentals of Probability and Bayes Rule• What’s a Joint Distribution• How to do inference (i.e. P(E1|E2)) once you have a JD

• Density Estimation• What is DE and what is it good for• How to learn a Joint DE• How to learn a naïve DE

What you should know

• Bayes Classifiers• How to build one• How to predict with a BC• Contrast between naïve and joint BCs

Interesting Questions

• Suppose you were evaluating NaiveBC, JointBC, and Decision Trees

• Invent a problem where only NaiveBC would do well• Invent a problem where only Dtree would do well• Invent a problem where only JointBC would do well• Invent a problem where only NaiveBC would do poorly• Invent a problem where only Dtree would do poorly• Invent a problem where only JointBC would do poorly

Venn Diagram

For more information

• Two nice books• L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.

Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.

• C4.5 : Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning) by J. Ross Quinlan

• Dozens of nice papers, including• Learning Classification Trees, Wray Buntine, Statistics and

Computation (1992), Vol 2, pages 63-73

• Kearns and Mansour, On the Boosting Ability of Top-Down Decision Tree Learning Algorithms, STOC: ACM Symposium on Theory of Computing, 1996“

• Dozens of software implementations available on the web for free and commercially for prices ranging between $50 - $300,000

Probability Model Uses

ClassifierInput

Attributes

AnomalyDetector

Data point x P(x | M)

P(C | E)

Inference

Engine

Subset Evidence E1P(E2 | e1)

ClustererData setclustersof points

Variables E2

How to Build a Bayes Classifier

Data Set P(I,A,R,C)

This function simulates a four-dimensional lookup table

of the probability of each possible Industry/Analyte/Result/Class

Each record has aclass of either “normal”or “outbreak”

Data Set

Data SetOutbreaks

Data SetNormals

P(I,A,R,O) P(I,A,R | normal)

Suppose that a new test result arrives…

<meat, salmonella, negative>

P(meat, salmonella, negative, normal) = 0.19

P(meat, salmonella, negative, outbreak) = 0.005

0.19-------- = 38.00.005

Class = “normal”!

Next test:

<Seafood, Vibrio, Positive>

P(seafood, vibrio, positive, normal) = 0.02

P(seafood, vibrio, positive, outbreak) = 0.07

0.02------ = 0.290.07

Class = “outbreak”!

Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....

Documents

2001-2004 - NCIGT · 2018-09-04 · Brigham and Women’s Hospital 2006- Director, Clinical IGT/NCIGT/AMIGO Program Brigham and Women’s Hospital 2007- Vice-Chair of Research Department

Brigham Young Universityhistory.cfac.byu.edu/images/3/32/John_Halliday_Interview...Brigham Young University

Jeffrey Nichols September 25, 2001 Handheld Computers in Higher Education Jeffrey Nichols Carnegie Mellon University September 25, 2001

PAUL BRIGHAM

Copyright © 2006, Brigham S. Anderson Active Learning as Active Inference Brigham S. Anderson brigham brigham@cmu.edu School of Computer

brigham april-2012

May 20, 2001 WORKSHOP PAPERS - Carnegie Mellon …aleven/AIED2001WS/AIED2001TutorDialogueSysWS.… · AIED-2001 Workshop on Tutorial Dialogue Systems ... 71 Introducing RMT: A dialog-based

Brigham and Women’s Hospital Department of Surgery Faculty ...dosprofessionaldevelopment.brighamandwomens.org/wp... · Brigham History Overviews History of the Brigham and Women’s

Toward Exploiting EEG Input in a Reading Tutorlisten/pdfs/2012 IJAIED EEG.pdf · Kai-min Changa, Jessica Nelsona, Udip Pantb, and Jack Mostowa, Carnegie Mellon Universitya, Brigham

Weston y Brigham

Ubiquitous Human-machine interactivity via a universal ...roni/papers/USI-TR-00-114.doc · Web viewRoni Rosenfeld, Carnegie Mellon University. Dan Olsen, Brigham Young University

Scalable Data Mining The Auton Lab, Carnegie Mellon University Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy

Tier 4 - Carnegie Mellon School of Computer Sciencesatya/docdir/flinn-middleware-puppeteer-2001.pdf · Æ !" # $ " $ ! % # # $ &'(

Manuela Veloso 15-381 - Fall 2001 - Carnegie Mellon School ...€¦ · Manuela Veloso 15-381 - Fall 2001 Veloso, ... Perceptron convergence theorem Rosenblatt 1962: ... and (b) or

1 The Digital Millennium Copyright Act David S. Touretzky Computer Science Department Carnegie Mellon University November, 2001

Brigham 16 ch

Ehrhardt & Brigham

060518 Brigham City

Carnegie Mellon Today: Machine Programming I: Basics · DAC 2001 Tutorial ©R.A. Rutenbar, 2001 1 Carnegie Mellon 1 Today: Machine Programming I: Basics History of Intel processors

Slide 1 Copyright © 2001, Andrew W. Moore Probabilistic and Bayesian Analytics Brigham S. Anderson School of Computer Science Carnegie Mellon University