39
Introduction to Bayesian statistics Yves Moreau

Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Embed Size (px)

Citation preview

Page 1: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Introduction to Bayesian statistics

Yves Moreau

Page 2: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Overview

The Cox-Jaynes axioms Bayes’ rule Probabilistic models

Maximum likelihood Maximum a posteriori

Bayesian inference Multinomial en Dirichlet distributions Estimation of frequency matrices

Pseudocounts Dirichlet mixture

Page 3: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

The Cox-Jaynes axioms and Bayes’ rule

Page 4: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Probability vs. belief

What is a probability? Frequentist point of view

Probabilities are what frequency counts (coin, die) and histograms (height of people)

Such definitions are somewhat circular because of the dependency on the Central Limit Theorem

Measure theory point of view Probabilities satisfy Kolmogorov’s -algebra axioms Rigorous definition fits well within measure and integration theory But definition is ad hoc to fit within this framework

Page 5: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Bayesian point of view Probabilities are models of the uncertainty regarding propositions

within a given domain Induction vs. deduction

Deduction IF ( A B AND A = TRUE )

THEN B = TRUE Induction

IF ( A B AND B = TRUE )THAN A becomes more plausible

Probabilities satisfy Bayes’ rule

Page 6: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

The Cox-Jaynes axioms

The Cox-Jaynes axioms allow the buildup of a large probabilistic framework with minimal assumptions

Firstly, some concepts A is a proposition

A TRUE or FALSE

D is a domain Information available about the current situation

BELIEF: (A=TRUE |D) Belief that we have regarding the proposition given the domain

knowledge

Page 7: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Secondly, some assumptions

1. Suppose we can compare beliefs (A|D) > (B|D) A is more plausible than B given Dand suppose the comparison is transitive

We have an ordering relation, so is a number

IF ( ( | ) ( | )) AND ( ( | ) ( | ))

THEN ( ( | ) ( | ))

A B B C

A C

D D D DD D

Page 8: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

2. Suppose there exists a fixed relation between the belief in a proposition and the belief in the negation of this proposition

3. Suppose there exists a fixed relation between on the one hand the belief in the union of two propositions and on the other hand the belief in the first proposition and the belief in the second proposition given the first one

)|()|()|()|( thus

))|(()|(

DDDDDD

BABA

AfA

)),|(),|(()|,( DDD ABAgBA

Page 9: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Bayes’ rule

THEN it can be shown (after rescaling of the beliefs) that

Bayes’ rule

If we accept the Cox-Jaynes axions, we can always apply Bayes’ rule, independently of the specific definition of the probabilities

)|()|().,|(

),|(D

DDDAP

BPBAPABP

)|().,|()|,(

1)|()|(

DDDDD

APABPBAP

APAP

Page 10: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Bayes’ rule

Bayes’ rule will be our main tool for building probabilistic models and to estimate them

Bayes’ rule holds not only for statements (TRUE/FALSE) but for any random variables (discrete or continuous)

Bayes’ rule holds for specific realizations of the random variables as well as for the whole distribution

( | , ). ( | )( | , )

( | )

p X x Y y p Y yp Y y X x

p X x

D DD D

( | , ). ( | )( | , )

( | )

p X Y p Yp Y X

p X

D DD D

Page 11: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Importance of the domain D

The domain D is a flexible concept that encapsulates the background information that is relevant for the problem

It is important to set up the problem within the right domain Example

Diagnosis of Tay-Sachs’ disease Rare disease that appears more frequently for Ashkenazi Jews With the same symptoms, the probability of the disease will be smaller if we are

in a hospital in Brussels that if we are in Mount Sinai Hospital in New York

If we try to build a model with all the patients in the world, this model will not be more efficient

),|(),|( NYBE DD SDPSDP

)World)NY

WorldAsjk,|(,|(

),Asjk,|(),Asjk,|(

DD

DD

SDPSDP

PSDPPSDP

Page 12: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Probabilistic models and inference

Page 13: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Probabilistic models

We have a domain D

We have observations D We have a model M with parameters

Example 1 Domain D: the genome of a given organism Data D: a DNA sequence S = ’ACCTGATCACCCT’ Model M: the sequences are generated by a discrete distribution

over the alphabet {A,C,G,T} Parameters : 1 with ),,,( TGCATGCA

Page 14: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Example 2 Domain D: all European people Data D: the length of people from a given group Model M: the length is normally distributed N(m,) Parameters : the mean m and the standard deviation

Page 15: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Generative models

It is often possible to set up a model of the likelihood of the data For example, for the DNA sequence

More sophisticated models are possible HMMs Gibbs sampling for motif finding Bayesian networks

We want to find the model that describes our observations

1

( | , )i

L

Si

P S M

Page 16: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Maximum likelihood

Maximum likelihood (ML)

Consistent: if the observation were generated by the model M with parameters *, then ML will converge to * when the number of observations goes to infinity

Note that the data might not be generated by any instance of the model

If the data set is small, there might be a large difference between ML en *

),|(argmax MDPML

Page 17: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Maximum a posteriori probability

Maximum a posteriori probability (MAP)

Bayes’ rule

Thus

),|(argmax MDPMAP

)|(/)|(),|(),|( MDPMPMDPMDP

posteriorlikelihood of the data

prior

)|()|(),|(

argmaxMDP

MPMDPMAP

a priori knowledge

plays no role inoptimization over

Page 18: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Posterior mean estimate

Posterior mean estimate

dMDPPME ),|(.

Page 19: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Distributions over parameters

Let us look carefully to P(|M) (or to P(|D,M)) P(|M) is a probability distribution over the PARAMETERS We have to handle both distributions over observations and over

parameters at the same time Example

Distribution of the length of people P(D|,M)

Prior P(|M)

)(Lp

Length

150

175

200

),( mN

)(mp

Meanlength

150

175

200

)(p

Standard deviationlength

5 10 15

Page 20: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Bayesian inference

If we want to update the probability of the parameters with new observations D

1. Choose a reasonable prior

2. Add the information from the data

3. Get the updated distributions of the parameters

(We often work with logarithms)

( | , ) ( | )( | , )

( | )

( | , ) ( | )

( | , ) ( | )

P D M P MP D M

P D M

P D M P M

P D M P M d

13

2

Page 21: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Bayesian inference

Example

)|( Mmp

Meanlength

150

175

200

),|( MBmp

Meanlength

150

175

200

),|( MHmp

Meanlength

150

175

200

100 Belgianmen

100 Dutchmen

Page 22: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Marginalization

A major technique for working with probabilistic models is to introduce or remove a variable through marginalization wherever appropriate

If a variable Y can take only k mutually exclusive outcomes, we have

If the variables are continuous1

( ) 1K

k

P Y k

1

( , ) ( )K

k

P X Y k P X

( , ) ( )y

P X Y y dy P X

Y

Page 23: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Multinomial and Dirichlet distributions

Page 24: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Multinomial distribution

Discrete distribution K independent outcomes with probabilities i

Example Die K=6 DNA sequence K=4 Amino acid sequence K=20

For K=2 we have a Bernoulli variable (giving rise to a binomial distribution)

i

( ) , 1,...,

with = |0 1 and 1

i

Ki i

P X i i K

Page 25: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

The multinomial distribution gives the number of times that the different outcomes were observed

The multinomial distribution is the natural distribution for the modeling of biological sequences

1 1 2 211

11

1

1( , ,..., ) ( ; )

(( ,..., ))

!with normalization factor (( ,..., ))

!

i

Kn

k k iik

K

ii

k K

kk

P N n N n N n nM n n

nM n n

n

M

Page 26: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Dirichlet distribution

Distribution over the region of the parameter space where

The distribution has parameters The Dirichlet distribution gives the probability of

The distribution is like a ‘dice factory’

Kiii ,...,1 , 10 and 1 i

Ki ,...,1 0, i

( 1)

1

( 1) 1

1

1

1( ; )

( )

( )( )

i

i

K

ii

K

iKi

i Ki

ik

Z

Z d

D

Page 27: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori
Page 28: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Dirichlet distribution

Z() is a normalization factor such that is de gamma function

Generalization of the factorial function to real numbers

The Dirichlet distribution is the natural prior for sequence analysis because this distribution is conjugate to the multinomial distribution, which means that if we have a Dirichlet prior and we update this prior with multinomial observations, the posterior will also have the form of a Dirichlet distribution Computationally very attractive

1)|( dP

)()1()!1()( xxxnn

Page 29: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori
Page 30: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Estimation of frequency matrices

Estimation on the basis of counts e.g., Position-Specific Scoring Matrix in PSI-BLAST Example: matrix model of a local motif

GACGTGCTCGAG

CGCGTGAACGTG

CACGTG

......

......

......

......

T

G

C

A

Count the number of instances in each column

Page 31: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

If there are many aligned sites (N>>), we can estimate the frequencies as

This is the maximum likelihood estimate for

NnNnNnNn TTGGCCAA /,/,/,/

Nn

nP

nPnNnNnNnNP

ML

TGCATTGGCCAA

)|(maxarg

)|(),,,|,,,(

Page 32: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Proof

We want to show that

This is equivalent to

Further

MLML nPnP ),|()|(

0))|(/)|(log( nPnP ML

entropy) of(property 0log

)/(log

l)multinomia ofn (definitiolog)|()|(

log

i i

MLiML

i

iML

ii i

MLi

i

i

ni

i

nMLiML

N

Nnn

nPnP

i

i

Page 33: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Pseudocounts

If we have a limited number of counts, the maximum likelihood estimate will not be reliable (e.g., for symbols not observed in the data)

In such a situation, we can combine the observations with prior knowledge

Suppose we use a Dirichlet prior : Let us compute the Bayesian update

( | ) ( ; )( | )

( )

P nP n

P n

D

( ; ) D

Page 34: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

( 1)

1

1 ( )( | ) ( ; )

( ) ( ) ( ) ( ) ( ) ( )i i

Kn

ii

Z nP n n

P n Z M n P n Z M n

D

( 1)

1

1( ; )

( )i

K

iiZ

D

K

i

ni

i

nMnP

1)(1

)( ( | ) ( ; )

( | )( )

P nP n

P n

D

( | ) ( ; )P n n D

11( ; )

( )k knPME

i i i kk

n d dZ n

D

Bayesian update

=1 because both distributionsare normalized

Computation of the posteriormean estimate

AN

n

nZ

nZ iiiPMEi

)(

)(Normalization integral Z(.)

)0,...,0,1,0,...,0(i

Page 35: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Pseudocounts

Pseudocounts

The prior contributes to the estimation through pseudo-observations

If few observations are available, then the prior plays an important role

If many observations are available, then the pseudocounts play a negligible role

i iiiPME

i AAN

n with

Page 36: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Dirichlet mixture

Sometimes the observations are generated by a heterogeneous process (e.g., hydrophobic vs. hydrophilic domains in proteins) In such situations, we should use different priors in function of

the context

But we do not necessarily know the context beforehand

A possibility is the use of a Dirichlet mixture The frequency parameter can be generated from m different

sources S with different Dirichlet parameters k

( ) ( ) ( ; ) ( ; )k kk

k k

P P S k q D D

Page 37: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Dirichlet mixture

Posterior

Via Bayes’ rule

( | ) ( | ) ( | , ) (disjunction)

( | ) ( ; ) (pseudocount)k

k

k

P n P S k n P S k n

P S k n n

D

( | )( | ) ( )( | )

( | ) ( ) ( | )k

ll l

q P n S kP n S k P S kP S k n

P n S l P S l q P n S l

( ) / ( )

( | )( ) / ( )

k kk

l lll

q Z n ZP S k n

q Z n Z

Page 38: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Dirichlet mixture

Posterior mean estimate

The different components of the Dirichlet mixture are first considered as separate pseudocounts

These components are then combined with a weight depending on the likelihood of the Dirichlet component

( | )k

PME i ii k

nP S k n

N A

( ) / ( )

( | )( ) / ( )

k kk

l kll

q Z n ZP S k n

q Z n Z

)/()( ANn kii

( | )P S k n

Page 39: Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori

Summary

The Cox-Jaynes axioms Bayes’ rule Probabilistic models

Maximum likelihood Maximum a posteriori

Bayesian inference Multinomial and Dirichlet distributions Estimation of frequency matrices

Pseudocounts Dirichlet mixture