Upload
clarence-price
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
.
Parameter Estimation using likelihood functions
Tutorial #1
This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at www.cs.huji.ac.il/~pmai. Changes made by Dan Geiger and Ydo Wexler.
2
Example: Binomial Experiment
When tossed, it can land in one of two positions: Head or Tail
Head Tail
We denote by the (unknown) probability P(H).
Estimation task:Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)= and P(T) = 1 -
3
Statistical Parameter Fitting
Consider instances x[1], x[2], …, x[M] such that
The set of values that x can take is known Each is sampled from the same distribution Each sampled independently of the rest
i.i.d.samples
The task is to find a vector of parameters that have generated the given data. This vector parameter can be used to predict future data.
4
The Likelihood Function How good is a particular ?
It depends on how likely it is to generate the observed data
The likelihood for the sequence H,T, T, H, H is
m
D mxPDPL )|][()|()(
)1()1()(DL
0 0.2 0.4 0.6 0.8 1
L()
5
Sufficient Statistics
To compute the likelihood in the thumbtack example we only require NH and NT
(the number of heads and the number of tails)
NH and NT are sufficient statistics for the binomial distribution
THD
NNL )1()(
6
Sufficient Statistics
A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood
Datasets
Statistics
Formally, s(D) is a sufficient statistics if for any two datasets D and D’
s(D) = s(D’ ) LD() = LD’ ()
7
Maximum Likelihood Estimation
MLE Principle:
Choose parameters that maximize the likelihood function
This is one of the most commonly used estimators in statistics
Intuitively appealing One usually maximizes the log-likelihood
function defined as lD() = logeLD()
8
Example: MLE in Binomial Data
Applying the MLE principle we get 1loglog THD NNl
1TH NN
0 0.2 0.4 0.6 0.8 1
L()
Example:(NH,NT ) = (3,2)
MLE estimate is 3/5 = 0.6
TH
H
NN
N
(Which coincides with what one would expect)
9
From Binomial to Multinomial
For example, suppose X can have the values 1,2,…,K (For example a die has 6 sides)
We want to learn the parameters 1, 2. …, K
Sufficient statistics:N1, N2, …, NK - the number of times each outcome is observed
Likelihood function:
MLE:
K
k
NkD
kL1
)(
N
Nkk
ˆ
10
Example: Multinomial
Let be a protein sequence We want to learn the parameters q1, q2,…,q20
corresponding to the frequencies of the 20 amino acids
N1, N2, …, N20 - the number of times each amino acid is observed in the sequence
Likelihood function:
20
1
)(k
NkD
kqqL
nxxx ....21
n
Nq k
k MLE:
11
Is MLE all we need?
Suppose that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack Would you bet on heads for the next toss?
Suppose now that after 10 observations,ML estimates P(H) = 0.7 for a coinWould you place the same bet?
Solution: The Bayesian approach which incorporates your subjective prior knowledge. E.g., you may know a priori that some amino acids have the same frequencies and some have low frequencies. How would one use this information ?
12
Bayes’ rule
)(
||
xP
yPyxPxyP
yPyxPyxP |,
Where
Bayes’ rule:
y
yPyxPxP |
It hold because:
yy
yPyxPyxPxP |,
13
Example: Dishonest Casino
A casino uses 2 kind of dice:
99% are fair
1% is loaded: 6 comes up 50% of the times We pick a die at random and roll it 3 times We get 3 consecutive sixes
What is the probability the die is loaded?
14
Dishonest Casino (cont.)
The solution is based on using Bayes rule and the fact that while P(loaded | 3sixes) is not known, the other three terms in Bayes rule are known, namely:
P(3sixes | loaded)=(0.5)3
P(loaded)=0.01 P(3sixes) =
P(3sixes | loaded) P(loaded)+P(3sixes | not loaded) (1-P(loaded))
)3(
)()|3()3|(
sixesP
loadedPloadedsixesPsixesloadedP
15
Dishonest Casino (cont.)
21.0
99.061
01.05.0
01.05.0
)()|3()()|3(
)()|3(
)3(
)()|3()3|(
33
3
fairPfairsixesPloadedPloadedsixesP
loadedPloadedsixesP
sixesP
loadedPloadedsixesPsixesloadedP
16
Biological Example: Proteins
Extracellular proteins have a slightly different amino acid composition than intracellular proteins.
From a large enough protein database (SWISS-PROT), we can get the following:
p(ext) - the probability that any new sequence is extracellular
int
iaq p(ai|int) - the frequency of amino acid ai for intracellular proteins
extai
q p(ai|ext) - the frequency of amino acid ai for extracellular
proteinsp(int) - the probability that any new sequence is intracellular
17
Biological Example: Proteins (cont.)
What is the probability that a given new protein sequence x=x1x2….xn is extracellular?
Assuming that every sequence is either extracellular or intracellular (but not both)
we can write .
Thus,
(int)1)( pextp
int|(int)|)( xppextxpextpxp
18
Biological Example: Proteins (cont.)
Using conditional probability we get
, )|(| extxpextxp ii int)|(int|
iixpxP
• The probabilities p(int), p(ext) are called the prior probabilities.
• The probability P(ext|x) is called the posterior probability.
ii
ii
ii
xppextxpextp
extxpextpxextP
int)|((int))|()(
)|()(|
By Bayes’ theorem
19
Bayesian Parameter Estimation (step by step)
1. View the unknown parameter as a random variable.
2. Use probability to quantify the uncertainty about the unknown parameter.
3. Update of the parameter follows from the rules of probability using Bayes rule
The updated parameter is set to be the expected value of the posterior .
|| DataPPDataP
DataP |
posterior of posterior of prior of prior of likelihoodlikelihood
20
Example: Binomial Data Revisited
Prior: say, a uniform distribution for in [0,1] P( ) = 1 (say due to seeing one head h = 1
and one tail t = 1 before seeing the data)
Then for data (NH,NT ) = (4,1)
MLE for is 4/5 = 0.8 Bayesian estimation gives
7142.0)11()41(
41
0 0.2 0.4 0.6 0.8 1
)()(
)|(ˆtthh
hh
NN
NdDP
Dirichletintegral
Dirichletintegral
21
The multinomial case
The hyper-parameters 1,…,K can be thought of as “imaginary” counts from our prior experience
Imaginary sample size = 1+…+K
The larger the imaginary sample size the more confident we are in our prior. The more it influences the outcome.
)(
ˆN
Nkkk
N
Nkk
Bayesian Estimation vs. MLE
22
Protein Example Continued(estimation of ext)
• Let’s assume that from biological literature we know that 40% of the proteins are extracellular and 60% are intracellular
ext= p(ext)=0.4
• Recall that we had
• Now, given a protein sequence x we have all the information we need to calculate p(ext|x).
ii
ii
ii
xppextxpextp
extxpextpxextp
int)|((int))|()(
)|()(|