Upload
erasmus-goodman
View
59
Download
0
Embed Size (px)
DESCRIPTION
Naïve Bayes. Bayesian Reasoning. - PowerPoint PPT Presentation
Citation preview
Naïve BayesNaïve Bayes
Bayesian ReasoningBayesian Reasoning• Bayesian reasoning provides a probabilistic
approach to inference. It is based on the assumption that the quantities of interest are governed by probability distributions and that optimal decisions can be made by reasoning about these probabilities together with observed data.
2
Probabilistic LearningProbabilistic Learning• In ML, we are often interested in
determining the best hypothesis from some space H, given the observed training data D.
• One way to specify what is meant by the best hypothesis is to say that we demand the most probable hypothesis, given the data D together with any initial knowledge about the prior probabilities of the various hypotheses in H.
3
Bayes TheoremBayes Theorem• Bayes theorem is the cornerstone of
Bayesian learning methods• It provides a way of calculating the
posterior probability P(h | D), from the prior probabilities P(h), P(D) and P(D | h), as follows:
)()()|(
)|(DP
hPhDPDhP
4
Using Bayes Theorem Using Bayes Theorem (I)(I)
• Suppose I wish to know whether someone is telling the truth or lying about some issue Xo The available data is from a lie detector with two
possible outcomes: truthful and liaro I also have prior knowledge that over the entire
population, 21% lie about Xo Finally, I know the lie detector is imperfect: it
returns truthful in only 94% of the cases where people actually told the truth and liar in only 87% of the cases where people where actually lying
5
Using Bayes Theorem Using Bayes Theorem (II)(II)
• P(lies about X) = 0.21
• P(liar | lies about X) = 0.93
• P(liar | tells the truth about X) = 0.15
6
• P(tells the truth about X) = 0.79
• P(truthful | lies about X) = 0.07
• P(truthful | tells the truth about X) = 0.85
Using Bayes Theorem Using Bayes Theorem (III)(III)
• Suppose a new person is asked about X and the lie detector returns liar
• Should we conclude the person is indeed lying about X or not
• What we need is to compare:o P(lies about X | liar)o P(tells the truth about X | liar)
7
Using Bayes Theorem Using Bayes Theorem (IV)(IV)
• By Bayes Theorem:o P(lies about X | liar) =
[P(liar | lies about X).P(lies about X)]/P(liar)o P(tells the truth about X | liar) =
[P(liar | tells the truth about X).P(tells the truth about X)]/P(liar)
• All probabilities are given explicitly, except for P(liar) which is easily computed (theorem of total probability):o P(liar) = P(liar | lies about X).P(lies about X) + P(liar |
tells the truth about X).P(tells the truth about X)
8
Using Bayes Theorem Using Bayes Theorem (V)(V)
• Computing, we get:o P(liar) = 0.93x0.21 + 0.15x0.89 = 0.329o P(lies about X | liar) = [0.93x0.21]/0.329 =
0.594o P(tells the truth about X | liar) =
[0.15x0.89]/0.329 = 0.406• And we would conclude that the person was
indeed lying about X
9
IntuitionIntuition• How did we make our decision?
o We chose the/a maximally probable or maximum a posteriori (MAP) hypothesis, namely:
)()|(maxarg)(
)()|(maxarg
)|(maxarg
hPhDPDP
hPhDP
DhPh
Hh
Hh
HhMAP
10
Brute-force MAP Brute-force MAP LearningLearning
• For each hypothesis hHo Calculate P(h | D) // using Bayes Theorem
• Return hMAP=argmaxhH P(h | D)
• Guaranteed “best” BUT often impractical for large hypothesis spaces: mainly used as a standard to gauge the performance of other learners
11
RemarksRemarks• The Brute-Force MAP learning algorithm
answers the question of: which is the most probable hypothesis given the training data?'
• Often, it is the related question of: which is the most probable classification of the new query instance given the training data?' that is most significant.
• In general, the most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities.
12
Bayes Optimal Bayes Optimal Classification (I)Classification (I)
• If the possible classification of the new instance can take on any value vj from some set V, then the probability P(vj | D) that the correct classification for the new instance is vj , is just:
Hih
iijj DhPhvPDvP )|()|()|(
13
Clearly, the optimal classification of the new Clearly, the optimal classification of the new instance is the value instance is the value vvjj, for which , for which PP((vvjj | | DD) is ) is maximum, which gives rise to the following maximum, which gives rise to the following algorithm to classify query instances.algorithm to classify query instances.
Bayes Optimal Bayes Optimal Classification (II)Classification (II)
• Return Hih
iijVjv
DhPhvP )|()|(maxarg
14
No other classification method using the same No other classification method using the same hypothesis space and same prior knowledge can hypothesis space and same prior knowledge can outperform this method on average, since it outperform this method on average, since it maximizes the probability that the new instance maximizes the probability that the new instance is classified correctly, given the available data, is classified correctly, given the available data, hypothesis space and prior probabilities over the hypothesis space and prior probabilities over the hypotheses.hypotheses.
The algorithm however, is impractical for large The algorithm however, is impractical for large hypothesis spaces.hypothesis spaces.
Naïve Bayes Learning Naïve Bayes Learning (I)(I)
• The naive Bayes learner is a practical Bayesian learning method.
• It applies to learning tasks where instances are conjunction of attribute values and the target function takes its values from some finite set V.
• The Bayesian approach consists in assigning to a new query instance the most probable target value, vMAP, given the attribute values a1, …, an that describe the instance, i.e.,
),,|(maxarg 1 njVjv
MAP aavPv
15
Naïve Bayes Learning Naïve Bayes Learning (II)(II)
• Using Bayes theorem, this can be reformulated as:
)()|,,(maxarg
),,(
)()|,,(maxarg
1
1
1
jjnVjv
n
jjn
VjvMAP
vPvaaP
aaP
vPvaaPv
16
Finally, we make the further simplifying Finally, we make the further simplifying assumption that the attribute values are assumption that the attribute values are conditionally independent given the target value. conditionally independent given the target value. Hence, one can write the conjunctive conditional Hence, one can write the conjunctive conditional probability as a product of simple conditional probability as a product of simple conditional probabilities.probabilities.
Naïve Bayes Learning Naïve Bayes Learning (III)(III)
• Return
n
ijij
VjvvaPvP
1
)|()(maxarg
17
The naive Bayes learning method involves a The naive Bayes learning method involves a learning step in which the various P(learning step in which the various P(vvjj) and P() and P(aaii | | vvjj) terms are estimated, based on their ) terms are estimated, based on their frequencies over the training data.frequencies over the training data.
These estimates are then used in the above These estimates are then used in the above formula to classify each new query instance.formula to classify each new query instance.
Whenever the assumption of conditional Whenever the assumption of conditional independence is satisfied, the naive Bayes independence is satisfied, the naive Bayes classification is identical to the MAP classification.classification is identical to the MAP classification.
Illustration (I)Illustration (I)Risk Assessment for Loan Applications
Client # Credit History Debt Level Collateral Income Level RISK LEVEL
1 Bad High None Low HIGH
2 Unknown High None Medium HIGH
3 Unknown Low None Medium MODERATE
4 Unknown Low None Low HIGH
5 Unknown Low None High LOW
6 Unknown Low Adequate High LOW
7 Bad Low None Low HIGH
8 Bad Low Adequate High MODERATE
9 Good Low None High LOW
10 Good High Adequate High LOW
11 Good High None Low HIGH
12 Good High None Medium MODERATE
13 Good High None High LOW
14 Bad High None Medium HIGH
18
Illustration (II)Illustration (II)Credit History Risk Level
High Moderate Low High Moderate LowUnknown 0.33 0.33 0.40 6 3 5 14Bad 0.50 0.33 0.00 0.43 0.21 0.36 1.00Good 0.17 0.33 0.60
Debt Level High Moderate LowHigh 0.67 0.33 0.40 Consider the query instance: (Bad, Low, Adequate, Medium)Low 0.33 0.67 0.60
High 0.00%Collateral High Moderate Low Moderate 1.06%
None 1.00 0.67 0.60 Low 0.00%Adequate 0.00 0.33 0.40 Prediction: Moderate
Income Level High Moderate LowHigh 0.00 0.33 1.00 Consider the query instance: (Bad, High, None, Low) - SeenMedium 0.33 0.67 0.00Low 0.67 0.00 0.00 High 9.52%
Moderate 0.00%Low 0.00%
Prediction: High
19
How is NB Incremental?How is NB Incremental?• No training instances are stored• Model consists of summary statistics that are sufficient to
compute prediction• Adding a new training instance only affects summary
statistics, which may be updated incrementally
Estimating Estimating ProbabilitiesProbabilities
• We have so far estimated P(X=x | Y=y) by the fraction nx|y/ny, where ny is the number of instances for which Y=y and nx|y is the number of these for which X=x
• This is a problem when nx is smallo E.g., assume P(X=x | Y=y)=0.05 and the
training set is s.t. that ny=5. Then it is highly probable that nx|y=0
o The fraction is thus an underestimate of the actual probability
o It will dominate the Bayes classifier for all new queries with X=x
21
mm-estimate-estimate• Replace nx|y/ny by:
mn
mpn
y
yx
|
22
Where Where pp is our prior estimate of the is our prior estimate of the probability we wish to determine and probability we wish to determine and mm is is a constanta constant Typically, Typically, pp = 1/ = 1/kk (where (where kk is the number of is the number of
possible values of possible values of XX) ) mm acts as a weight (similar to adding acts as a weight (similar to adding mm virtual virtual
instances distributed according to instances distributed according to pp))
Revisiting Conditional Revisiting Conditional
IndependenceIndependence• Definition: X is conditionally independent of Y
given Z iff P(X | Y, Z) = P(X | Z)• NB assumes that all attributes are conditionally
independent, given the class. Hence,
n
ii
nn
nn
nnn
VAP
VAAPVAAAPVAPVAP
VAAPVAAAPVAP
VAAPVAAAPVAAP
1
44321
3321
2211
)|(
))|,,(),,,|()|()|(
)|,,(),,,|()|(
)|,,(),,,|()|,,(
23
What if ?What if ?• In many cases, the NB assumption is overly
restrictive• What we need is a way of handling independence
or dependence over subsets of attributeso Joint probability distribution
• Defined over Y1 x Y2 x … x Yn
• Specifies the probability of each variable binding
24
Bayesian Belief Bayesian Belief NetworkNetwork
• Directed acyclic graph:o Nodes represent variables in the joint spaceo Arcs represent the assertion that the variable is
conditionally independent of it non descendants in the network given its immediate predecessors in the network
o A conditional probability table is also given for each variable: P(V | immediate predecessors)
• Refer to section 6.11
25