lecture12(2) (1)

7/27/2019 lecture12(2) (1)

1/23

EECS 440: Machine Learning

Soumya Ray

http://engr.case.edu/ray_soumya/eecs440_fall13/

[email protected]

Office: Olin 516

Office hours: Th, Fri 1:30-2:30 or by appointmentText: Machine Learning by Tom Mitchell

10/10/2013 1Soumya Ray, Case Western Reserve U.
mailto:[email protected]:[email protected]

7/27/2019 lecture12(2) (1)

2/23

Today

Bayesian Learning

Read Mitchell Chapter 6, extra material on

website

10/10/2013 Soumya Ray, Case Western Reserve U. 2

7/27/2019 lecture12(2) (1)

3/23

Nave Bayes

Simplest generative classifier for discrete data


,

1

( , ) ( | ) ( )

( , , | ) ( )

( | ) ( )

Y i i i i i i i i

i in i i i i

ij ij i i i i

j

p y p Y y p Y y

p x x Y y p Y y

p X x Y y p Y y

= = = =

= = =

= = = =

Xx X x

Nave Bayes parameters: Instead of storing probabilities

for each example, we will only store these conditional

probabilities and use this formula to calculate the

probability for an example.

Nave Bayes

assumption:

Attributes are

conditionally

independent

given the class

7/27/2019 lecture12(2) (1)

4/23

ML Hypothesis

Ifevery hypothesis inHhas equal prior

probability, only the first term matters

This gives the maximum likelihood (ML)

hypothesis


arg max Pr( | )ML

h H

h D h

=

7/27/2019 lecture12(2) (1)

5/23

Nave Bayes Parameter MLEs


# observed examples with 1 and 1 ( 1| 1)# observed examples with 1

( 1, 1)( 1| 1)

( 1)

ii

ii

X Yp X YY

p X Yp X Y

p Y

= == = ==

= == = =

=

# observed examples with 1 ( 1)

# observed examples

Yp Y

== =

7/27/2019 lecture12(2) (1)

6/23

Example


Has-fur? Long-Teeth? Scary? Lion?

Animal1 Yes No No No

Animal2 No Yes Yes No

Animal3 Yes Yes Yes Yes

p(Has-fur=Yes|Lion)=?, p(Has-fur=Yes|Not-Lion)=?

p(Long-Teeth=Yes|Lion)=?, p(Long-Teeth=Yes|Not-Lion)=?

p(Scary=Yes|Lion)=?, p(Scary=Yes|Not-Lion)=?

p(Lion)=?

7/27/2019 lecture12(2) (1)

7/23

Smoothing probability estimates

What happens if a certain value for a variableis not in our set of examples, for a certainclass?

Suppose were trying to classify lions and wevenever seen a lion cub, so

When we see a cub, its probability of being a lionwill be zero by our Nave Bayes formula, even if it

has long teeth and fur Its a good idea to smooth our probability

estimates to avoid this


( | ) 0p Scary false Lion= =

7/27/2019 lecture12(2) (1)

8/23

m-Estimates

p is our prior estimate of the probability

m is called Equivalent Sample Size whichdetermines the importance ofp relative to the

observations

If variable has v values, the specific case ofm=v,p=1/v is called Laplace smoothing


(#examples with and )( | )(#examples with )

i ii i

X x Y y mpp X x Y yY y m

= = += = == +

7/27/2019 lecture12(2) (1)

9/23

Nominal Attributes

Need to estimate parametersp(Xi=vk|Y=y)

Can use maximum likelihood estimates:


( )( | )

( )

#examples with and#examples with

i ki k

i k

p X v Y yp X v Y y

p Y y

X v Y y

Y y

= == = =

=

= ==

=

7/27/2019 lecture12(2) (1)

10/23

Continuous Attributes

IfXi is a continuous attribute, can modelp(Xi|y) as a

Gaussian distribution (Gaussian nave Bayes)

MLEs


| |( | ) ( , )i i y i yp X y N

2

2

( )

( )

( ) ( )

( )i

ik k

k examples

i

k

k examples

ik i k

k examples

k

k examples

x I y y

I y y

x I y y

I y y

=

==

=

==

7/27/2019 lecture12(2) (1)

11/23

Nave Bayes Geometry

What does the decision surface of the nave

Bayes classifier look like?

An example is classified positive iff

p(x,y=1) >p(x,y=0)


( , 1)1

( , 0)

( | 1) ( 1)

1( | 0) ( 0)

i

i

i

i

p y

p y

p x y p y

p x y p y

=>

=

= =>

= =

x

x

7/27/2019 lecture12(2) (1)

12/23

Nave Bayes Geometry

Classify an example as positive if


( | 1)( 1)ln ln

( 0) ( | 0)

( | 1) ( 1)

1( | 0) ( 0)

1( | 1)( 1)

0ln ln( 0) ( | 0)

i

ii

i

i

i

i

p x yp y

p y p x y

i

i i

p x y p y

p x y p y

ep x yp y

p y p x y

==+

= =

= =

>= =

>

==>+

= =

7/27/2019 lecture12(2) (1)

13/23

Nave Bayes Geometry


,

( | 1)( 1)0ln ln

( 0) ( | 0)

( | 1)( 1)

0ln ln ( )( 0) ( | 0)

( ) 0,

( | 1)( 1),ln ln

( 0) ( | 0)

i

i i

i

ii v i

iv i

i v

iiv

i

p x yp y

p y p x y

p X v yp y

I X vp y p yX v

b w I X v

p X v yp yb w

p y p yX v

== >+ = =

= ==

>+ = = ==

+ = >

= === = = ==

So Nave Bayes implementsa lineardecision boundary

with specific parameters

Indicator function

7/27/2019 lecture12(2) (1)

14/23

Nave Bayes and Text Classification

Used very successfully to categorize

documents

Is this document about sports or finance?

Is this email spam or ham?

Given a vocabulary, each attributeXi is the

presence/absence of word i in the document

Ignores word order

Bag-of-words approach


7/27/2019 lecture12(2) (1)

15/23

Text Classification contd.

Smoothed parameter estimates

Called Multivariate Bernoulli model


( present | )

(#documents with present and ) 1

(#documents with ) 2

k

k

p word Y y

word Y y

Y y

= =

= +

= +

Bernoulli distribution

7/27/2019 lecture12(2) (1)

16/23

Tree Augmented Nave Bayes

Can augment the model so that there is a treestructure over the attributes

In this case, the structure is also unknown

Given a training sample, algorithm exists to learnoptimal structure

Makes fewer independence assumptions thanNB, better classification performance


7/27/2019 lecture12(2) (1)

17/23

Tree Augmented Nave Bayes

Create a complete graph

Nodes are attributes

Edges are weighted byI(X;Y|C)

Find the maximal weighted spanning tree of

this graph Can show this is the tree structure that maximizes

likelihood (see paper on website)


,

( , | )( ; | ) ( , , ) log( | ) ( | )X Y

P x y cI X Y C P x y cP x c P y c

=

Class conditional

mutual information

7/27/2019 lecture12(2) (1)

18/23

Logistic Regression

Simplest Discriminative model

Models log odds as a linear function


[ ] ( )

( ) ( )

( )

( ) ( )

( 1| )log( 0 | )

( 1| ) 1 ( 1| )

( 1| )(1 )1

( 1| )1 1

b

b b

b

b b

p Y bp Y

p Y p Y e

p Y e e

ep Y

e e

+

+ +

+

+ +

== +

=

= = =

= + =

= = =

+ +

w x

w x w x

w x

w x w x

x

w x

x

x x

x

x

7/27/2019 lecture12(2) (1)

19/23

Estimating parameters

Use MLE, optimize log conditionallikelihood of

the data


, arg max ( | )

arg max log ( 1| ) log ( 0 | )

1 1arg max log log 11 1

i i

i

i i i i

i pos i neg

b bi pos i neg

b p Y

p Y p Y

e e

+ +

=

= = + =

= + + +

w x w x

w x

x x

Conditional Likelihood

7/27/2019 lecture12(2) (1)

20/23

Estimating parameters

Can use gradient descent, Newton methodsetc

In practice, also use overfitting control via||w||

Very robust method, works extremely well inmany practical situations, very easy to code

Often good idea to try this first!


7/27/2019 lecture12(2) (1)

21/23

Logistic Regression Geometry

Classify as positive iff:

So like NB, LR alsoimplements a linear decisionboundary---but whats the difference?


( 1| )1

( 0 | )

( 1| )or if log 0( 0 | )

( 1| )But log

( 0 | )

So classify as positive iff 0

p Y

p Y

p Yp Y

p Yb

p Y

b

=>

=

=>=

== +

=

+ >

x

x

x

x

x

w x

x

w x

7/27/2019 lecture12(2) (1)

22/23

Relationship to Nave Bayes

For certain values ofw, b logistic regression willimplement the same decision surface as naveBayes

Both are linear discriminants, but LR does notmake the independence assumptions of NB More robust than NB, especially in the presence of

irrelevant attributes

Also handles continuous attributes nicely But (as with all discriminative models) no easy way to

handle missing data


7/27/2019 lecture12(2) (1)

23/23

Generative and Discriminative Pairs


Training Sample Size

Accuracy Generative

DiscriminativeGenerative w/ correct model

Documents

lecture12(2) (1)