13
Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί & Γεωμετρικοί Αλγόριθμοι στη Μοριακή Βιολογία Διδάσκων: Ι. Εμίρης

Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

Algebraic Statistics for Computational Biology

Lior Pachter and Bernd Sturmfels

Ch.5: Parametric Inference R. Mihaescu

Παρουσίαση: Aγγελίνα Βιδάλη

Αλγεβρικοί & Γεωμετρικοί Αλγόριθμοι στη Μοριακή ΒιολογίαΔιδάσκων: Ι. Εμίρης

Page 2: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

),min(: yxyx yxyx :

Convenient algebraic structure for stating dynamic programming algorithms:

the tropical semiring ),,(

Tropical arithmetic

(Convex hull)

(Minkowski sum)

)(: QPconvQP

QPQP :

The polytope agebra (d ),,

natural higher-dimensional generalization:

Page 3: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

Inference

From Observed random variables Y1 = σ1,…,Yn = σn

we want to infer values for the Hidden random variables Χ1,…,Χm: Unknown biological data, i.e.:

• How do two sequences allign?

MAP estimation: given an observation σ1,…,σn which is the most probable explanation X1 =h1,…, Χm =hm ?

Model parameters give transition probabilities phσ :

hidden state h σ observed state

Observation: σ1,…,σn : Known biological data

Page 4: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

Observation: σ1,…,σn

We want to compute an explanation for the observation:

the sequence h1,…,hm which yields the maximum a prosteriori probability (MAP): ),,,,,(max 1111

1mmnn

hhYYhXhXP

n

nhh

mmnn YYhXhXPp,,

1111

1

),,,,,(

We can efficiently compute the marginal probabilities:

Hidden Markov Model (HMM)

Page 5: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

Computation of the marginal probabilities:

n

nnnnhh

hhhhhhh pppppp,,1

1222111''

,''1 11 1 1

1121111

s

h

s

hhhhhhh

s

hh

n

nnnn

n

nnpppppp

pσ has the decomposition

which gives the “Forward algorithm”.

Markov chain:Independent probabilities

Page 6: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

Viterbi algorithm

problem of computing pσ

Tropicalization: uij=-log(p’ij) vij=-log(pij)

nnnn

nhhhhhhh

hhvuvuv

12221111 ,,min

2111

1111

1

minminmin hhhh

hhhh

hh

uvuvvnnnn

nnn

n

We can now efficiently find an explanation h1,…,hm for the observation σ1,…,σn using the recursion:

It is again the Forward algorithm.

Page 7: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

Pair Hidden Markov Model (pHMM)

The algebraic statistical model for sequence alignment, known as the pair hidden Markov model, is the image of the map

where An,m is the set of all alignments of the sequences σ1, σ2.

Page 8: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

• The Needleman-Wunsch algorithm for finding the shortest path in the alignment graph is the tropicalization of the pair hidden Markov model for sequence allignment.

gttta-gt--gc

gtgc

g t t t a

Example:n=5, m=4

**

Page 9: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

The polytope propagation algorithm

• Tropical sum-product algorithm in general fashion.

f is the density function for a statistical model.

From the d monomials find the one that maximizes

Solution: • Tropicalization: wi=-logpi &• Computation in the ploytope algebra

.)(1

11

ikei

k

d

i

e pppf

),,,,,(max 11111

mmnnhh

YYhXhXPn

Page 10: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

Density function for a statistical model: f(p1,p2)=p1

3+p12p2

2+p1p22+p1+p2

4

• Find the index j of the monomial that minimizes the function ej

.w.

2121211 4,,2,22,3min wwwwwww

•Find an explanation

•Find the index j of the monomial with maximal value

Tropicalization:

wi=-logpi

Page 11: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

Explanations are vertices of the Newton Polytope of f

p13p1

1

f(p1,p2)=p13+p1

2p22+p1p2

2+p1+p24

we find a point for each exponent

vector of a monomial

Page 12: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

Normal fan

• The normal fan partitions the parameter space into regions such that: the explanation(s) for all sets of parameters in a given region is given by the polytope vertex(face) associated to that region.

Page 13: Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί

Parametric MAP estimation problem

• Local: given a choice of parameters determine the set of all parameters with the same MAP estimate.

• Solution: Computation of the normal cone of the Newton Polytope.

• Global: asks for a partition of the space of parameters such that any two parameters lie in the same part iff they yield the same MAP estimate.

•Solution: Computation of the normal fan of the Newton Polytope.