Upload
michael-mccoy
View
227
Download
0
Embed Size (px)
Citation preview
7/27/2019 lecture12(2) (1)
1/23
EECS 440: Machine Learning
Soumya Ray
http://engr.case.edu/ray_soumya/eecs440_fall13/
Office: Olin 516
Office hours: Th, Fri 1:30-2:30 or by appointmentText: Machine Learning by Tom Mitchell
10/10/2013 1Soumya Ray, Case Western Reserve U.
mailto:[email protected]:[email protected]7/27/2019 lecture12(2) (1)
2/23
Today
Bayesian Learning
Read Mitchell Chapter 6, extra material on
website
10/10/2013 Soumya Ray, Case Western Reserve U. 2
7/27/2019 lecture12(2) (1)
3/23
Nave Bayes
Simplest generative classifier for discrete data
10/10/2013 Soumya Ray, Case Western Reserve U. 9
,
1
( , ) ( | ) ( )
( , , | ) ( )
( | ) ( )
Y i i i i i i i i
i in i i i i
ij ij i i i i
j
p y p Y y p Y y
p x x Y y p Y y
p X x Y y p Y y
= = = =
= = =
= = = =
Xx X x
Nave Bayes parameters: Instead of storing probabilities
for each example, we will only store these conditional
probabilities and use this formula to calculate the
probability for an example.
Nave Bayes
assumption:
Attributes are
conditionally
independent
given the class
7/27/2019 lecture12(2) (1)
4/23
ML Hypothesis
Ifevery hypothesis inHhas equal prior
probability, only the first term matters
This gives the maximum likelihood (ML)
hypothesis
10/10/2013 Soumya Ray, Case Western Reserve U. 20
arg max Pr( | )ML
h H
h D h
=
7/27/2019 lecture12(2) (1)
5/23
Nave Bayes Parameter MLEs
10/10/2013 Soumya Ray, Case Western Reserve U. 25
# observed examples with 1 and 1 ( 1| 1)# observed examples with 1
( 1, 1)( 1| 1)
( 1)
ii
ii
X Yp X YY
p X Yp X Y
p Y
= == = ==
= == = =
=
# observed examples with 1 ( 1)
# observed examples
Yp Y
== =
7/27/2019 lecture12(2) (1)
6/23
Example
10/10/2013 Soumya Ray, Case Western Reserve U. 26
Has-fur? Long-Teeth? Scary? Lion?
Animal1 Yes No No No
Animal2 No Yes Yes No
Animal3 Yes Yes Yes Yes
p(Has-fur=Yes|Lion)=?, p(Has-fur=Yes|Not-Lion)=?
p(Long-Teeth=Yes|Lion)=?, p(Long-Teeth=Yes|Not-Lion)=?
p(Scary=Yes|Lion)=?, p(Scary=Yes|Not-Lion)=?
p(Lion)=?
7/27/2019 lecture12(2) (1)
7/23
Smoothing probability estimates
What happens if a certain value for a variableis not in our set of examples, for a certainclass?
Suppose were trying to classify lions and wevenever seen a lion cub, so
When we see a cub, its probability of being a lionwill be zero by our Nave Bayes formula, even if it
has long teeth and fur Its a good idea to smooth our probability
estimates to avoid this
10/10/2013 Soumya Ray, Case Western Reserve U. 27
( | ) 0p Scary false Lion= =
7/27/2019 lecture12(2) (1)
8/23
m-Estimates
p is our prior estimate of the probability
m is called Equivalent Sample Size whichdetermines the importance ofp relative to the
observations
If variable has v values, the specific case ofm=v,p=1/v is called Laplace smoothing
10/10/2013 Soumya Ray, Case Western Reserve U. 28
(#examples with and )( | )(#examples with )
i ii i
X x Y y mpp X x Y yY y m
= = += = == +
7/27/2019 lecture12(2) (1)
9/23
Nominal Attributes
Need to estimate parametersp(Xi=vk|Y=y)
Can use maximum likelihood estimates:
10/10/2013 Soumya Ray, Case Western Reserve U. 29
( )( | )
( )
#examples with and#examples with
i ki k
i k
p X v Y yp X v Y y
p Y y
X v Y y
Y y
= == = =
=
= ==
=
7/27/2019 lecture12(2) (1)
10/23
Continuous Attributes
IfXi is a continuous attribute, can modelp(Xi|y) as a
Gaussian distribution (Gaussian nave Bayes)
MLEs
10/10/2013 Soumya Ray, Case Western Reserve U. 30
| |( | ) ( , )i i y i yp X y N
2
2
( )
( )
( ) ( )
( )i
ik k
k examples
i
k
k examples
ik i k
k examples
k
k examples
x I y y
I y y
x I y y
I y y
=
==
=
==
7/27/2019 lecture12(2) (1)
11/23
Nave Bayes Geometry
What does the decision surface of the nave
Bayes classifier look like?
An example is classified positive iff
p(x,y=1) >p(x,y=0)
10/10/2013 Soumya Ray, Case Western Reserve U. 31
( , 1)1
( , 0)
( | 1) ( 1)
1( | 0) ( 0)
i
i
i
i
p y
p y
p x y p y
p x y p y
=>
=
= =>
= =
x
x
7/27/2019 lecture12(2) (1)
12/23
Nave Bayes Geometry
Classify an example as positive if
10/10/2013 Soumya Ray, Case Western Reserve U. 32
( | 1)( 1)ln ln
( 0) ( | 0)
( | 1) ( 1)
1( | 0) ( 0)
1( | 1)( 1)
0ln ln( 0) ( | 0)
i
ii
i
i
i
i
p x yp y
p y p x y
i
i i
p x y p y
p x y p y
ep x yp y
p y p x y
==+
= =
= =
>= =
>
==>+
= =
7/27/2019 lecture12(2) (1)
13/23
Nave Bayes Geometry
10/10/2013 Soumya Ray, Case Western Reserve U. 33
,
( | 1)( 1)0ln ln
( 0) ( | 0)
( | 1)( 1)
0ln ln ( )( 0) ( | 0)
( ) 0,
( | 1)( 1),ln ln
( 0) ( | 0)
i
i i
i
ii v i
iv i
i v
iiv
i
p x yp y
p y p x y
p X v yp y
I X vp y p yX v
b w I X v
p X v yp yb w
p y p yX v
== >+ = =
= ==
>+ = = ==
+ = >
= === = = ==
So Nave Bayes implementsa lineardecision boundary
with specific parameters
Indicator function
7/27/2019 lecture12(2) (1)
14/23
Nave Bayes and Text Classification
Used very successfully to categorize
documents
Is this document about sports or finance?
Is this email spam or ham?
Given a vocabulary, each attributeXi is the
presence/absence of word i in the document
Ignores word order
Bag-of-words approach
10/10/2013 Soumya Ray, Case Western Reserve U. 34
7/27/2019 lecture12(2) (1)
15/23
Text Classification contd.
Smoothed parameter estimates
Called Multivariate Bernoulli model
10/10/2013 Soumya Ray, Case Western Reserve U. 35
( present | )
(#documents with present and ) 1
(#documents with ) 2
k
k
p word Y y
word Y y
Y y
= =
= +
= +
Bernoulli distribution
7/27/2019 lecture12(2) (1)
16/23
Tree Augmented Nave Bayes
Can augment the model so that there is a treestructure over the attributes
In this case, the structure is also unknown
Given a training sample, algorithm exists to learnoptimal structure
Makes fewer independence assumptions thanNB, better classification performance
10/10/2013 Soumya Ray, Case Western Reserve U. 38
7/27/2019 lecture12(2) (1)
17/23
Tree Augmented Nave Bayes
Create a complete graph
Nodes are attributes
Edges are weighted byI(X;Y|C)
Find the maximal weighted spanning tree of
this graph Can show this is the tree structure that maximizes
likelihood (see paper on website)
10/10/2013 Soumya Ray, Case Western Reserve U. 39
,
( , | )( ; | ) ( , , ) log( | ) ( | )X Y
P x y cI X Y C P x y cP x c P y c
=
Class conditional
mutual information
7/27/2019 lecture12(2) (1)
18/23
Logistic Regression
Simplest Discriminative model
Models log odds as a linear function
10/10/2013 Soumya Ray, Case Western Reserve U. 40
[ ] ( )
( ) ( )
( )
( ) ( )
( 1| )log( 0 | )
( 1| ) 1 ( 1| )
( 1| )(1 )1
( 1| )1 1
b
b b
b
b b
p Y bp Y
p Y p Y e
p Y e e
ep Y
e e
+
+ +
+
+ +
== +
=
= = =
= + =
= = =
+ +
w x
w x w x
w x
w x w x
x
w x
x
x x
x
x
7/27/2019 lecture12(2) (1)
19/23
Estimating parameters
Use MLE, optimize log conditionallikelihood of
the data
10/10/2013 Soumya Ray, Case Western Reserve U. 41
, arg max ( | )
arg max log ( 1| ) log ( 0 | )
1 1arg max log log 11 1
i i
i
i i i i
i pos i neg
b bi pos i neg
b p Y
p Y p Y
e e
+ +
=
= = + =
= + + +
w x w x
w x
x x
Conditional Likelihood
7/27/2019 lecture12(2) (1)
20/23
Estimating parameters
Can use gradient descent, Newton methodsetc
In practice, also use overfitting control via||w||
Very robust method, works extremely well inmany practical situations, very easy to code
Often good idea to try this first!
10/10/2013 Soumya Ray, Case Western Reserve U. 42
7/27/2019 lecture12(2) (1)
21/23
Logistic Regression Geometry
Classify as positive iff:
So like NB, LR alsoimplements a linear decisionboundary---but whats the difference?
10/10/2013 Soumya Ray, Case Western Reserve U. 43
( 1| )1
( 0 | )
( 1| )or if log 0( 0 | )
( 1| )But log
( 0 | )
So classify as positive iff 0
p Y
p Y
p Yp Y
p Yb
p Y
b
=>
=
=>=
== +
=
+ >
x
x
x
x
x
w x
x
w x
7/27/2019 lecture12(2) (1)
22/23
Relationship to Nave Bayes
For certain values ofw, b logistic regression willimplement the same decision surface as naveBayes
Both are linear discriminants, but LR does notmake the independence assumptions of NB More robust than NB, especially in the presence of
irrelevant attributes
Also handles continuous attributes nicely But (as with all discriminative models) no easy way to
handle missing data
10/10/2013 Soumya Ray, Case Western Reserve U. 44
7/27/2019 lecture12(2) (1)
23/23
Generative and Discriminative Pairs
10/10/2013 Soumya Ray, Case Western Reserve U. 48
Training Sample Size
Accuracy Generative
DiscriminativeGenerative w/ correct model