Upload
corey-grant
View
217
Download
0
Embed Size (px)
Citation preview
STATISTICAL LEARNING(FROM DATA TO DISTRIBUTIONS)
AGENDA
Learning a discrete probability distribution from data
Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation
MOTIVATION
Previously studied how to infer characteristics of a distribution, given a fully-specified Bayes net
This lecture: where does the specification of the Bayes net come from? Specifically: estimating the parameters of the
CPTs from fully-specified data
LEARNING IN GENERAL
Agent has made observations (data) Now must make sense of it (hypotheses) Why?
Hypotheses alone may be important (e.g., in basic science)
For inference (e.g., forecasting) To take sensible actions (decision making)
A basic component of economics, social and hard sciences, engineering, …
MACHINE LEARNING VS. STATISTICS
Machine Learning automated statistics This lecture
Statistical learning (aka Bayesian learning) Maximum likelihood (ML) learning Maximum a posteriori (MAP) learning Learning Bayes Nets (R&N 20.1-3)
Future lectures try to do more with even less data Decision tree learning Neural nets Support vector machines …
PUTTING GENERAL PRINCIPLES TO PRACTICE
Note: this lecture will cover general principles of statistical learning on toy problems
Grounded in some of the most theoretically principled approaches to learning But the techniques are far more general Practical applications to larger problems requires
a bit of mathematical savvy and “elbow grease”
BEAUTIFUL RESULTS IN MATH GIVE RISE TO SIMPLE IDEAS
OR
HOW TO JUMP THROUGH HOOPS IN ORDER TO JUSTIFY SIMPLE IDEAS …
CANDY EXAMPLE
• Candy comes in 2 flavors, cherry and lime, with identical wrappers
• Manufacturer makes 5 indistinguishable bags
• Suppose we draw• What bag are we holding? What flavor will we draw
next?
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
BAYESIAN LEARNING
Main idea: Compute the probability of each hypothesis, given the data
Data d: Hypotheses: h1,…,h5
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
BAYESIAN LEARNING
Main idea: Compute the probability of each hypothesis, given the data
Data d: Hypotheses: h1,…,h5
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
P(hi|d)
P(d|hi)
We want this…
But all we have is this!
USING BAYES’ RULE
P(hi|d) = a P(d|hi) P(hi) is the posterior
(Recall, 1/a = P(d) = Si P(d|hi) P(hi))
P(d|hi) is the likelihood
P(hi) is the hypothesis prior
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
LIKELIHOOD AND PRIOR
Likelihood is the probability of observing the data, given the hypothesis model
Hypothesis prior is the probability of a hypothesis, before having observed any data
COMPUTING THE POSTERIOR
Assume draws are independent Let P(h1),…,P(h5) = (0.1, 0.2, 0.4, 0.2, 0.1) d = { 10 x }
P(d|h1) = 0P(d|h2) = 0.2510
P(d|h3) = 0.510
P(d|h4) = 0.7510
P(d|h5) = 110
P(d|h1)P(h1)=0P(d|h2)P(h2)=9e-8P(d|h3)P(h3)=4e-4P(d|h4)P(h4)=0.011P(d|h5)P(h5)=0.1
P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90
Sum = 1/a = 0.1114
POSTERIOR HYPOTHESES
PREDICTING THE NEXT DRAW
P(X|d) = Si P(X|hi,d)P(hi|d)
= Si P(X|hi)P(hi|d)
P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90
H
D X
P(X|h1) =0P(X|h2) =0.25P(X|h3) =0.5P(X|h4) =0.75P(X|h5) =1
Probability that next candy drawn is a lime
P(X|d) = 0.975
P(NEXT CANDY IS LIME | D)
PROPERTIES OF BAYESIAN LEARNING
If exactly one hypothesis is correct, then the posterior probability of the correct hypothesis will tend toward 1 as more data is observed
The effect of the prior distribution decreases as more data is observed
HYPOTHESIS SPACES OFTEN INTRACTABLE To learn a probability distribution, a
hypothesis would have to be a joint probability table over state variables 2n entries => hypothesis space is 2n-1-
dimensional! 2^(2n) deterministic hypotheses
6 boolean variables => over 1022 hypotheses And what the heck would a prior be?
LEARNING COIN FLIPS Let the unknown fraction of cherries be q
(hypothesis) Probability of drawing a cherry is q Suppose draws are independent and
identically distributed (i.i.d) Observe that c out of N draws are cherries
(data)
LEARNING COIN FLIPS Let the unknown fraction of cherries be q
(hypothesis) Intuition: c/N might be a good hypothesis (or it might not, depending on the draw!)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q
P(d|q) = Pj P(dj|q) = qc (1-q)N-c
i.i.d assumption Gather c cherry terms together, then N-c lime terms
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1/1 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
2/2 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
2/3 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.01
0.02
0.03
0.04
0.05
0.06
0.07
2/4 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
2/5 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.10.20.30.40.50.60.70.80.9 10
0.0000002
0.0000004
0.0000006
0.0000008
0.000001
0.0000012
10/20 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1E-31
2E-31
3E-31
4E-31
5E-31
6E-31
7E-31
8E-31
9E-31
50/100 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
Peaks of likelihood function seem to hover around the fraction of cherries…
Sharpness indicates some notion of certainty…
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1E-31
2E-31
3E-31
4E-31
5E-31
6E-31
7E-31
8E-31
9E-31
50/100 cherry
q
P(d
ata
|q)
MAXIMUM LIKELIHOOD
P(d|q) be the likelihood function The quantity argmaxq P(d|q) is known as the
maximum likelihood estimate (MLE)
MAXIMUM LIKELIHOOD
The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function
So the MLE is the same as maximizing log likelihood… but: Multiplications turn into additions We don’t have to deal with such tiny numbers
MAXIMUM LIKELIHOOD
The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function
l(q) = log P(d|q) = log [ qc (1-q)N-c]
MAXIMUM LIKELIHOOD
The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function
l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]
MAXIMUM LIKELIHOOD
The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function
l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]= c log q + (N-c) log (1-q)
MAXIMUM LIKELIHOOD
The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function
l(q) = log P(d|q) = c log q + (N-c) log (1-q) At a maximum of a function, its derivative is 0 So, dl/dq( )q = 0 at the maximum likelihood
estimate
MAXIMUM LIKELIHOOD
The maximum of P(d|q) is obtained at the same place that the maximum of log P(d|q) is obtained Log is a monotonically increasing function
l(q) = log P(d|q) = c log q + (N-c) log (1-q) At a maximum of a function, its derivative is 0 So, dl/dq( )q = 0 at the maximum likelihood
estimate => 0 = c/q – (N-c)/(1-q)
=> q = c/N
OTHER CLOSED-FORM MLE RESULTS
Multi-valued variables: take fraction of counts for each value
Continuous Gaussian distributions: take average value as mean, standard deviation of data as standard deviation
MAXIMUM LIKELIHOOD FOR BN
For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values
Alarm
Earthquake Burglar
E: 500 B: 200
N=1000
P(E) = 0.5 P(B) = 0.2
A|E,B: 19/20A|E,B: 188/200A|E, B: 170/500A|E,B : 1/380
E B P(A|E,B)
T T 0.95
F T 0.95
T F 0.34
F F 0.003
BAYES NET ML ALGORITHM
Input: Bayes net with nodes X1,…,Xn, dataset D=(d[1],…,d[N]) Each d[i] = (x1[i],…,xn[i]) is a sample of the full
state of the world
For each node X with parents Y1,…,Yk: For all y1Val(Y1),…, ykVal(Yk)
For all xVal(X) Count the number of times (X=x, Y1=y1,…, Yk=yk) is
observed in D. Let this be mx
Count the number of times (Y1=y1,…, Yk=yk) is observed in D. Let this be m. (note m=xmx)
Set P(x|y1,…,yk) = mx / m for all xVal(X)
MAXIMUM LIKELIHOOD PROPERTIES
As the number of data points approaches infinity, the MLE approaches the true estimate
With little data, MLEs can vary wildly
MAXIMUM LIKELIHOOD IN CANDY BAG EXAMPLE
hML = argmaxhi P(d|hi)
P(X|d) P(X|hML)
hML = undefined h5
P(X|hML)
P(X|d)
BACK TO COIN FLIPS
The MLE is easy to compute… but what about those small sample sizes?
Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?
A particularly acute problem for BN nodes with many parents!
(the data fragmentation problem)
MAXIMUM A POSTERIORI ESTIMATION
Maximum a posteriori (MAP) estimation Idea: use the hypothesis prior to get a better
initial estimate than ML, without resorting to full Bayesian estimation “Most coins I’ve seen have been fair coins, so I
won’t let the first few tails sway my estimate much”
“Now that I’ve seen 100 tails in a row, I’m pretty sure it’s not a fair coin anymore”
MAXIMUM A POSTERIORI
P(q|d) is the posterior probability of the hypothesis, given the data
argmaxq P(q|d) is known as the maximum a posteriori (MAP) estimate
Posterior of hypothesis q given data d={d1,…,dN} P(q|d) = 1/a P(d|q) P(q) Max over q doesn’t affect a So MAP estimate is argmaxq P(d|q) P(q)
MAXIMUM A POSTERIORI
hMAP = argmaxhi P(hi|d)
P(X|d) P(X|hMAP)
hMAP = h3 h4 h5
P(X|hMAP)
P(X|d)
ADVANTAGES OF MAP AND MLE OVER BAYESIAN ESTIMATION Involves an optimization rather than a large
summation Local search techniques
For some types of distributions, there are closed-form solutions that are easily computed
BACK TO COIN FLIPS
Need some prior distribution P(q) P(q|d) P(d|q)P(q) = qc (1-q)N-c P(q)
Define, for all q, the probability that I believe in q
10 q
P(q)
MAP ESTIMATE
Could maximize qc (1-q)N-c P(q) using some optimization
Turns out for some families of P(q), the MAP estimate is easy to compute
10 q
P(q)
Beta distributions
BETA DISTRIBUTION
Betaa,b(q) = g qa-1 (1-q)b-1
a, b hyperparameters > 0 g is a normalization
constant a=b=1 is uniform
distribution
POSTERIOR WITH BETA PRIOR
Posterior qc (1-q)N-c P(q)= g qc+a-1 (1-q)N-c+b-1
= Betaa+c,b+N-c(q)
Prediction = meanE[ ]q =(c+a)/(N+a+b)
POSTERIOR WITH BETA PRIOR
What does this mean? Prior specifies a “virtual
count” of a=a-1 heads, b=b-1 tailsSee heads, increment aSee tails, increment b
MAP estimate is a/(a+b)Effect of prior diminishes
with more data
MAP FOR BN
For any BN, the MAP parameters of any CPT can be derived by the fraction of observed values in the data + the prior virtual counts
Alarm
Earthquake Burglar
E: 500+100 B: 200+200
N=1000+1000
P(E) = 0.5 0.3 P(B) = 0.2 0.2
A|E,B: (19+100)/(20+100)A| E,B: (188+180)/(200+200)A|E, B: (170+100)/(500+400)A|E,B : (1+3)/(380 + 300)
E B P(A|E,B)
T T 0.95 0.99
F T 0.94 0.92
T F 0.34 0.3
F F 0.0030.006
EXTENSIONS OF BETA PRIORS
Multiple discrete values: Dirichlet prior Mathematical expression more complex, but in
practice still takes the form of “virtual counts” Mean, standard deviation for Gaussian
distributions: Gamma prior Conjugate priors preserve the representation
of prior and posterior distributions, but do not necessary exist for general distributions
RECAP
Bayesian learning: infer the entire distribution over hypotheses Robust but expensive for large problems
Maximum Likelihood (ML): optimize likelihood Good for large datasets, not robust for small
ones Maximum A Posteriori (MAP): weight
likelihood by a hypothesis prior during optimization A compromise solution for small datasets Like Bayesian learning, how to specify the prior?
HOMEWORK
Read R&N 18.1-3
APPLICATION TO STRUCTURE LEARNING
So far we’ve assumed a BN structure, which assumes we have enough domain knowledge to declare strict independence relationships
Problem: how to learn a sparse probabilistic model whose structure is unknown?
BAYESIAN STRUCTURE LEARNING
But what if the Bayesian network has unknown structure?
Are Earthquake and Burglar independent? P(E,B) = P(E)P(B)?
Earthquake Burglar
?B E
0 0
1 1
0 1
1 0
… …
BAYESIAN STRUCTURE LEARNING
Hypothesis 1: dependent P(E,B) as a probability table has 3 free
parameters Hypothesis 2: independent
Individual tables P(E),P(B) have 2 total free parameters
Maximum likelihood will always prefer hypothesis 1!
Earthquake Burglar
B E
0 0
1 1
0 1
1 0
… …
BAYESIAN STRUCTURE LEARNING
Occam’s razor: Prefer simpler explanations over complex ones
Penalize free parameters in probability tables
Earthquake Burglar
B E
0 0
1 1
0 1
1 0
… …
ITERATIVE GREEDY ALGORITHM
Fit a simple model (ML), compute likelihood Repeat:
Earthquake Burglar
B E
0 0
1 1
0 1
1 0
… …
ITERATIVE GREEDY ALGORITHM
Fit a simple model (ML), compute likelihood Repeat:
Pick a candidate edge to add to the BN
Earthquake Burglar
B E
0 0
1 1
0 1
1 0
… …
ITERATIVE GREEDY ALGORITHM
Fit a simple model (ML), compute likelihood Repeat:
Pick a candidate edge to add to the BN Compute new ML probabilities
Earthquake Burglar
B E
0 0
1 1
0 1
1 0
… …
ITERATIVE GREEDY ALGORITHM
Fit a simple model (ML), compute likelihood Repeat:
Pick a candidate edge to add to the BN Compute new ML probabilities If new likelihood exceeds old likelihood +
threshold, Add the edge
Otherwise, repeat with a different edge Earthquake Burglar
B E
0 0
1 1
0 1
1 0
… …
(Usually log likelihood)
OTHER TOPICS IN LEARNING DISTRIBUTIONS
Learning continuous distributions using kernel density estimation / nearest neighbor smoothing
Expectation Maximization for hidden variables Learning Bayes nets with hidden variables Learning HMMs from observations Learning Gaussian Mixture Models of continuous
data
NEXT TIME
Introduction to machine learning R&N 18.1-2