Upload
trinhthu
View
234
Download
0
Embed Size (px)
Citation preview
Administrative
• HW 1 out today. Please start early!
• Office hours• Chen: Wed 4pm-5pm
• Shih-Yang: Fri 3pm-4pm
• Location: Whittemore 266
Linear Regression
• Model representationℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥
• Cost function 𝐽 𝜃 =1
2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
• Gradient descent for linear regressionRepeat until convergence {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
1
𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗
𝑖}
• Features and polynomial regression
Can combine features; can use different functions to generate features (e.g., polynomial)
• Normal equation 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦
(𝑥0) Size in feet^2(𝑥1)
Number of bedrooms
(𝑥2)
Number of floors (𝑥3)
Age of home (years) (𝑥4)
Price ($) in 1000’s (y)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
… …
𝑦 =
460232315178
𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦Slide credit: Andrew Ng
Least square solution
• 𝐽 𝜃 =1
2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
=1
2𝑚σ𝑖=1𝑚 𝜃⊤𝑥(𝑖) − 𝑦 𝑖 2
=1
2𝑚𝑋𝜃 − 𝑦 2
2
•𝜕
𝜕𝜃𝐽 𝜃 = 0
• 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦
Justification/interpretation 1
•Geometric interpretation
𝑿 =
1 ← 𝒙(1) →1⋮1
← 𝒙(2) →⋮
← 𝒙(𝑚) →
=↑𝒛𝟎
↑𝒛𝟏
↑𝒛𝟐
⋯↑𝒛𝒏
↓ ↓ ↓ ↓
• 𝑿𝜽: column space of 𝑿 or span({𝒛𝟎, 𝒛𝟏, ⋯ , 𝒛𝒏})
• Residual 𝑿𝜽 − 𝐲 is orthogonal to the column space of 𝑿
• 𝑿⊤ 𝑿𝜽 − 𝐲 = 0 → (𝑿⊤𝑿)𝜽 = 𝑿⊤𝒚
𝒚
column space of 𝑿
𝑿𝜽
𝑿𝜽 − 𝒚
Justification/interpretation 2
•Probabilistic model• Assume linear model with Gaussian errors
𝑝𝜃 𝑦 𝑖 𝑥 𝑖 =1
2𝜋𝜎2exp(−
1
2𝜎2(𝑦 𝑖 − 𝜃⊤𝑥 𝑖 )
• Solving maximum likelihood
argmin𝜃
ෑ
𝑖=1
𝑚
𝑝𝜃 𝑦 𝑖 𝑥 𝑖
argmin𝜃
log(ෑ
𝑖=1
𝑚
𝑝 𝑦 𝑖 𝑥 𝑖 ) = argmin𝜃
1
2𝜎2
𝑖=1
𝑚1
2𝜃⊤𝑥 𝑖 − 𝑦 𝑖 2
Image credit: CS 446@UIUC
Justification/interpretation 3
• Loss minimization
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2=
1
𝑚
𝑖=1
𝑚
𝐿𝑙𝑠 ℎ𝜃 𝑥 𝑖 , 𝑦 𝑖
• 𝐿𝑙𝑠 𝑦, ො𝑦 =1
2𝑦 − ො𝑦 2
2: Least squares loss
• Empirical Risk Minimization (ERM)1
𝑚
𝑖=1
𝑚
𝐿𝑙𝑠 𝑦 𝑖 , ො𝑦
𝑚 training examples, 𝑛 features
Gradient Descent
•Need to choose 𝛼
•Need many iterations
•Works well even when 𝑛 is large
Normal Equation
•No need to choose 𝛼
•Don’t need to iterate
•Need to compute (𝑋⊤𝑋)−1
•Slow if 𝑛 is very large
Slide credit: Andrew Ng
Things to remember
• Model representationℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥
• Cost function 𝐽 𝜃 =1
2𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
• Gradient descent for linear regressionRepeat until convergence {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
1
𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗
𝑖}
• Features and polynomial regression
Can combine features; can use different functions to generate features (e.g., polynomial)
• Normal equation 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦
Today’s plan
•Probability basics
• Estimating parameters from data• Maximum likelihood (ML)
• Maximum a posteriori estimation (MAP)
•Naïve Bayes
Today’s plan
•Probability basics
• Estimating parameters from data• Maximum likelihood (ML)
• Maximum a posteriori estimation (MAP)
•Naive Bayes
Random variables
• Outcome space S• Space of possible outcomes
• Random variables• Functions that map outcomes to real numbers
• Event E• Subset of S
Visualizing probability 𝑃(𝐴)
A is true
Sample space
Area = 1
A is false
𝑃 𝐴 = Area of the blue circle
Visualizing conditional probability
A
𝑃 𝐴|𝐵 = 𝑃 𝐴, 𝐵 /𝑃(𝐵)
B
A^B
Corollary: The chain rule
𝑃 𝐴, 𝐵 = 𝑃 𝐴|𝐵 𝑃(𝐵)
Bayes rule
A𝑃 𝐴|𝐵 =
𝑃 𝐴, 𝐵
𝑃 𝐵
=𝑃 𝐵 𝐴 𝑃 𝐴
𝑃(𝐵)
B
A^B
Corollary: The chain rule
𝑃 𝐴, 𝐵 = 𝑃 𝐴|𝐵 𝑃 𝐵 = P B P A B
Thomas Bayes
Other forms of Bayes rule
𝑃 𝐴|𝐵 =𝑃 𝐵 𝐴 𝑃 𝐴
𝑃(𝐵)
𝑃 𝐴|𝐵, 𝑋 =𝑃 𝐵 𝐴, 𝑋 𝑃 𝐴, 𝑋
𝑃(𝐵, 𝑋)
𝑃 𝐴|𝐵 =𝑃 𝐵 𝐴 𝑃 𝐴
𝑃 𝐵 𝐴 𝑃 𝐴 + 𝑃 𝐵 ~𝐴 𝑃(~𝐴)
Applying Bayes rule
𝑃 𝐴|𝐵 =𝑃 𝐵 𝐴 𝑃 𝐴
𝑃 𝐵 𝐴 𝑃 𝐴 + 𝑃 𝐵 ~𝐴 𝑃(~𝐴)
• A = you have the flu B = you just coughed
• Assume:• 𝑃 𝐴 = 0.05• 𝑃 𝐵 𝐴 = 0.8• 𝑃 𝐵 ~𝐴 = 0.2
• What is P(flu | cough) = P(A|B)?
𝑃 𝐴|𝐵 =0.8 × 0.05
0.8 × 0.05 + 0.2 × 0.95~0.17
Slide credit: Tom Mitchell
Joint distribution
• Making a joint distribution of M variables
1. Make a truth table listing all combinations
2. For each combination of values, say how probable it is
3. Probability must sum to 1
A B C Prob
0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
Slide credit: Tom Mitchell
Using joint distribution
• Can ask for any logical expression involving these variables
• 𝑃 𝐸 = σrowsmatching E𝑃(row)
• 𝑃 𝐸1|𝐸2 =σrowsmatching E1 and 𝐸2
𝑃 row
σrowsmatching 𝐸2𝑃 row
Slide credit: Tom Mitchell
The solution to learn 𝑃 𝑌|𝑋 ?
• Main problem: learning 𝑃 𝑌|𝑋 may require more data than we have
• Say, learning a joint distribution with 100 attributes
• # of rows in this table?
• # of people on earth?
2100 ≥ 1030
109
Slide credit: Tom Mitchell
What should we do?
1. Be smart about how we estimate probabilities from sparse data• Maximum likelihood estimates (ML)• Maximum a posteriori estimates (MAP)
2. Be smart about how to represent joint distributions• Bayes network, graphical models (more on this later)
Slide credit: Tom Mitchell
Today’s plan
•Probability basics
• Estimating parameters from data• Maximum likelihood (ML)
• Maximum a posteriori (MAP)
•Naive Bayes
Estimating the probability
• Flip the coin repeatedly, observing• It turns heads 𝛼1 times
• It turns tails 𝛼0 times
• Your estimate for 𝑃 𝑋 = 1 is?
• Case A: 100 flips: 51 Heads (𝑋 = 1), 49 Tails (𝑋 = 0)
𝑃 𝑋 = 1 = ?
• Case B: 3 flips: 2 Heads (𝑋 = 1), 1 Tails (𝑋 = 0)
𝑃 𝑋 = 1 = ?
𝑋 = 1 𝑋 =0
Slide credit: Tom Mitchell
Two principles for estimating parameters
•Maximum Likelihood Estimate (MLE) Choose 𝜃 that maximizes probability of observed data
𝜽MLE = argmax𝜃
𝑃(𝐷𝑎𝑡𝑎|𝜃)
•Maximum a posteriori estimation (MAP)Choose 𝜃 that is most probable given prior probability and data
𝜽MAP = argmax𝜃
𝑃 𝜃 𝐷 = argmax𝜃
𝑃 𝐷𝑎𝑡𝑎 𝜃 𝑃 𝜃
𝑃(𝐷𝑎𝑡𝑎)Slide credit: Tom Mitchell
Two principles for estimating parameters
•Maximum Likelihood Estimate (MLE) Choose 𝜃 that maximizes 𝑃 𝐷𝑎𝑡𝑎 𝜃
𝜽MLE =𝛼1
𝛼1 + 𝛼0
•Maximum a posteriori estimation (MAP)Choose 𝜃 that maximize 𝑃 𝜃 𝐷𝑎𝑡𝑎
𝜽MAP =(𝛼1 + #halluciated 1s)
(𝛼1+#halluciated 1𝑠) + (𝛼0 + #halluciated 0s)Slide credit: Tom Mitchell
Maximum likelihood estimate
• Each flip yields Boolean value for 𝑋
𝑋 ∼ Bernoulli: 𝑃 𝑋 = 𝜃𝑋 1 − 𝜃 1−𝑋
• Data set 𝐷 of independent, identically distributed (iid) flips, produces 𝛼1 ones, 𝛼0 zeros
𝑃 𝐷 𝜃 = 𝑃 𝛼1, 𝛼0 𝜃 = 𝜃𝛼1 1 − 𝜃 𝛼0
𝜽 = argmax𝜃
𝑃(𝐷|𝜃) =𝛼1
𝛼1 + 𝛼0
𝑋 = 1 𝑋 =0
𝑃 𝑋 = 1 = 𝜃𝑃 𝑋 = 0 = 1 − 𝜃
Slide credit: Tom Mitchell
Beta prior distribution 𝑃 𝜃
•𝑃 𝜃 = 𝐵𝑒𝑡𝑎 𝛽1, 𝛽0 =1
𝐵(𝛽1,𝛽0)𝜃𝛽1−1 1 − 𝜃 𝛽0−1
Slide credit: Tom Mitchell
Maximum likelihood estimate
• Data set 𝐷 of iid flips, produces 𝛼1 ones, 𝛼0 zeros
𝑃 𝐷 𝜃 = 𝑃 𝛼1, 𝛼0 𝜃 = 𝜃𝛼1 1 − 𝜃 𝛼0
• Assume prior (Conjugate prior: Closed form representation of posterior)
𝑃 𝜃 = 𝐵𝑒𝑡𝑎 𝛽1, 𝛽0 =1
𝐵(𝛽1, 𝛽0)𝜃𝛽1−1 1 − 𝜃 𝛽0−1
𝜽 = argmax𝜃
𝑃 𝐷 𝜃 P(𝜃) =𝛼1 + 𝛽1 − 1
(𝛼1 + 𝛽1 − 1) + (𝛼0 + 𝛽0 − 1)
𝑋 = 1 𝑋 =0
Slide credit: Tom Mitchell
Some terminology
• Likelihood function 𝑃 𝐷𝑎𝑡𝑎 𝜃
• Prior 𝑃 𝜃
• Posterior 𝑃 𝜃 𝐷𝑎𝑡𝑎
• Conjugate prior:
Prior 𝑃 𝜃 is the conjugate prior for a likelihood function𝑃 𝐷𝑎𝑡𝑎 𝜃 if the prior 𝑃 𝜃 and the posterior 𝑃 𝜃 𝐷𝑎𝑡𝑎 have the same form.
• Example (coin flip problem)• Prior 𝑃 𝜃 : 𝐵𝑒𝑡𝑎 𝛽1, 𝛽0 Likelihood 𝑃 𝐷𝑎𝑡𝑎 𝜃 : Binomial 𝜃𝛼1 1 − 𝜃 𝛼0
• Posterior 𝑃 𝜃 𝐷𝑎𝑡𝑎 : 𝐵𝑒𝑡𝑎 𝛼1 + 𝛽1, 𝛼0 + 𝛽0Slide credit: Tom Mitchell
How many parameters?
• Suppose 𝑋 = [𝑋1, ⋯ , 𝑋𝑛], where
𝑋𝑖 and 𝑌 are Boolean random variables
To estimate 𝑃 𝑌 𝑋1, ⋯ , 𝑋𝑛)
When 𝑛 = 2 (Gender, Hours-worked)?
When 𝑛 = 30?
Slide credit: Tom Mitchell
Can we reduce paras using Bayes rule?
𝑃 𝑌|𝑋 =𝑃 𝑋 𝑌 𝑃 𝑌
𝑃(𝑋)
•How many parameters for 𝑃(𝑋1, ⋯ , 𝑋𝑛|𝑌)?2𝑛 − 1 × 2
•How many parameters for 𝑃 𝑌 ?1
Slide credit: Tom Mitchell
Today’s plan
•Probability basics
• Estimating parameters from data• Maximum likelihood (ML)
• Maximum a posteriori estimation (MAP)
•Naive Bayes
Naïve Bayes
• Assumption:
𝑃 𝑋1, ⋯ , 𝑋𝑛 𝑌 = ෑ
𝑗=1
𝑛
𝑃(𝑋𝑗|𝑌)
• i.e., 𝑋𝑖 and 𝑋𝑗 are conditionally independentgiven 𝑌 for 𝑖 ≠ 𝑗
Slide credit: Tom Mitchell
Conditional independence
• Definition: 𝑋 is conditionally independent of 𝑌 given 𝑍, if the probability distribution governing 𝑋 is independent of the value of 𝑌, given the value of 𝑍
∀𝑖, 𝑗, 𝑘 𝑃 𝑋 = 𝑥𝑖 𝑌 = 𝑦𝑗 , 𝑍 = 𝑧𝑘) = 𝑃(𝑋 = 𝑥𝑖|𝑍𝑘)
𝑃 𝑋 𝑌, 𝑍 = 𝑃(𝑋|𝑍)
Example:𝑃 Thunder Rain, Lightning = 𝑃(Thunder|Lightning)
Slide credit: Tom Mitchell
Applying conditional independence
• Naïve Bayes assumes 𝑋𝑖 are conditionally independent given 𝑌
e.g., 𝑃 𝑋1 𝑋2, 𝑌 = 𝑃(𝑋1|𝑌)
𝑃 𝑋1, 𝑋2 𝑌 = 𝑃 𝑋1 𝑋2, 𝑌 𝑃 𝑋2 𝑌
= 𝑃 𝑋1 𝑌 𝑃(𝑋2|𝑌)
General form: 𝑃 𝑋1, ⋯ , 𝑋𝑛 𝑌 = ς𝑗=1𝑛 𝑃(𝑋𝑗|𝑌)
How many parameters to describe 𝑃 𝑋1, ⋯ , 𝑋𝑛 𝑌 ? 𝑃(Y)?
• Without conditional indep assumption?
• With conditional indep assumption?Slide credit: Tom Mitchell
Naïve Bayes classifier
• Bayes rule:
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) =𝑃(𝑌 = 𝑦𝑘)𝑃(𝑋1, ⋯ , 𝑋𝑛 𝑌 = 𝑦𝑘
σ𝑗 𝑃 𝑌 = 𝑦𝑗 𝑃 𝑋1, ⋯ , 𝑋𝑛 𝑌 = 𝑦𝑗
• Assume conditional independence among 𝑋𝑖’s:
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) =𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
σ𝑗 𝑃 𝑌 = 𝑦𝑗 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑗)
• Pick the most probable Y𝑌 ← argmax
𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
Slide credit: Tom Mitchell
Naïve Bayes algorithm – discrete Xi• For each value yk
Estimate 𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘)
For each value xij of each attribute XiEstimate 𝜃𝑖𝑗𝑘 = 𝑃(𝑋𝑖 = 𝑥𝑖𝑗𝑘|𝑌 = 𝑦𝑘)
• Classify Xtest
𝑌 ← argmax𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖test 𝑌 = 𝑦𝑘)
𝑌 ← argmax𝑦𝑘
𝜋𝑘 Π𝑖𝜃𝑖𝑗𝑘
Estimating parameters: discrete 𝑌, 𝑋𝑖• Maximum likelihood estimates (MLE)
ො𝜋𝑘 = 𝑃 𝑌 = 𝑦𝑘 =#𝐷 𝑌 = 𝑦𝑘
𝐷
መ𝜃𝑖𝑗𝑘 = 𝑃 𝑋𝑖 = 𝑥𝑖𝑗 𝑌 = 𝑦𝑘 =#𝐷 𝑋𝑖 = 𝑥𝑖𝑗 ^ 𝑌 = 𝑦𝑘
#𝐷{𝑌 = 𝑦𝑘}
Slide credit: Tom Mitchell
• F = 1 iff you live in Fox Ridge
• S = 1 iff you watched the superbowl last night
• D = 1 iff you drive to VT
• G = 1 iff you went to gym in the last month
𝑃 𝐹 = 1 =𝑃 𝑆 = 1|𝐹 = 1 =𝑃 𝑆 = 1|𝐹 = 0 =𝑃 𝐷 = 1|𝐹 = 1 =𝑃 𝐷 = 1|𝐹 = 0 =𝑃 𝐺 = 1|𝐹 = 1 =𝑃 𝐺 = 1|𝐹 = 0 =
𝑃 𝐹 = 0 =𝑃 𝑆 = 0|𝐹 = 1 =𝑃 𝑆 = 0|𝐹 = 0 =𝑃 𝐷 = 0|𝐹 = 1 =𝑃 𝐷 = 0|𝐹 = 0 =𝑃 𝐺 = 0|𝐹 = 1 =𝑃 𝐺 = 0|𝐹 = 0 =
𝑃 𝐹|𝑆, 𝐷, 𝐺 = 𝑃 𝐹 P S F P D F P(G|F)
Naïve Bayes: Subtlety #1
• Often the 𝑋𝑖 are not really conditionally independent
• Naïve Bayes often works pretty well anyway• Often the right classification, even when not the right probability [Domingos
& Pazzani, 1996])
• What is the effect on estimated P(Y|X)?• What if we have two copies: 𝑋𝑖 = 𝑋𝑘
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
Slide credit: Tom Mitchell
Naïve Bayes: Subtlety #2
MLE estimate for 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) might be zero.
(for example, 𝑋𝑖 = birthdate. 𝑋𝑖 = Feb_4_1995)
• Why worry about just one parameter out of many?𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
• What can we do to address this?• MAP estimates (adding “imaginary” examples)
Slide credit: Tom Mitchell
Estimating parameters: discrete 𝑌, 𝑋𝑖• Maximum likelihood estimates (MLE)
ො𝜋𝑘 = 𝑃 𝑌 = 𝑦𝑘 =#𝐷 𝑌 = 𝑦𝑘
𝐷
መ𝜃𝑖𝑗𝑘 = 𝑃 𝑋𝑖 = 𝑥𝑖𝑗 𝑌 = 𝑦𝑘 =#𝐷 𝑋𝑖 = 𝑥𝑖𝑗 , 𝑌 = 𝑦𝑘
#𝐷{𝑌 = 𝑦𝑘}
• MAP estimates (Dirichlet priors):
ො𝜋𝑘 = 𝑃 𝑌 = 𝑦𝑘 =#𝐷 𝑌 = 𝑦𝑘 + (𝛽𝑘−1)
𝐷 + σ𝑚(𝛽𝑚−1)
መ𝜃𝑖𝑗𝑘 = 𝑃 𝑋𝑖 = 𝑥𝑖𝑗 𝑌 = 𝑦𝑘 =#𝐷 𝑋𝑖 = 𝑥𝑖𝑗 , 𝑌 = 𝑦𝑘 + (𝛽𝑘 −1)
#𝐷{𝑌 = 𝑦𝑘} + σ𝑚(𝛽𝑚−1)
Slide credit: Tom Mitchell
What if we have continuous Xi• Gaussian Naïve Bayes (GNB): assume
𝑃 𝑋𝑖 = 𝑥 𝑌 = 𝑦𝑘 =1
2𝜋𝜎𝑖𝑘exp(−
𝑥 − 𝜇𝑖𝑘
2𝜎𝑖𝑘2
2
)
• Additional assumption on 𝜎𝑖𝑘:• Is independent of 𝑌 (𝜎𝑖)
• Is independent of 𝑋𝑖 (𝜎𝑘)
• Is independent of 𝑋i and 𝑌 (𝜎𝑘)
Slide credit: Tom Mitchell
Naïve Bayes algorithm – continuous Xi
• For each value ykEstimate 𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘)
For each attribute Xi estimate
Class conditional mean 𝜇𝑖𝑘, variance 𝜎𝑖𝑘
• Classify Xtest
𝑌 ← argmax𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖test 𝑌 = 𝑦𝑘)
𝑌 ← argmax𝑦𝑘
𝜋𝑘 Π𝑖 𝑁𝑜𝑟𝑚𝑎𝑙(𝑋𝑖test, 𝜇𝑖𝑘, 𝜎𝑖𝑘)
Slide credit: Tom Mitchell
Things to remember
•Probability basics
• Estimating parameters from data• Maximum likelihood (ML) maximize 𝑃(Data|𝜃)
• Maximum a posteriori estimation (MAP) maximize 𝑃(𝜃|Data)
•Naive Bayes𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)