A simple algorithm to infer gene duplication and speciation events on a gene tree...

Preview:

Citation preview

A simple algorithm to infer gene duplication and speciation

events on a gene tree

生物資訊相關演算法 期末報告

學生:陳智豪 王秀綾 王緯誠 江志民 侯藹玲時間: 1 月 21 日 地點:中研院資訊所

C. M. Zmasek and S. R. Eddy, ( 2001 )Bioinformatics, 17(9): 821--828,

INTRODUCTION

The enormous amount of sequence data currently produced by the various genome project.

Many proteins belong to large superfamilies that consist of subfamilies with different biological function complicates.

Superfamily and subfamilySuperfamily 以物種的來源區分,也就是說,雖然是不同蛋白質間的胺基酸序列相似性程度不高,但是它們的結構與功能相近,顯示它們可能有共同的演化來源,便視為同一個Superfamily 。Subfamily 是用演化的相關性來歸類,通常蛋白質之間的胺基酸序列有大於 30% 的相同,便可視為有明顯的演化關連而屬於同一個subfamily 。值得注意的是,胺基酸序列的同質性高並不等於結構和/或功能的相似 。

Method for automated sequence function prediction

Pairwise sequence similarity such as BLAST (basic local alignment search tool).

Analyses using profile search algorithms such as HMMER.

Protein family databases such as Pfam and InterPro.

What is “phylogenomics”

To use this multiple alignment to infer amore specific function,as input for a phylogenetic tree analysis, and from the placement of the new sequence in the tree of known sequences.

Why using phylogenomics ? In many cases , the identification of homologs is not sufficient to make specific functional predictions , because not all homologs have the same function.

Paralogous vs orthologous

Two genes are said to be paralogous if they are derived from a duplication event, but orthologous if they are derived from a speciation event.

Paralogous vs orthologous

Paralogous vs orthologous

Gene trees vs species tree

COG database

Although the COG method is clear a major advance in identifying orthologous groups of genes , it is limited in its power because clustering is a way of classifying levels of similarity and is not an accurate method of inferring evolutionary relationships.

Phylogenomics and COG database

Phylogenomics : 是利用多基因序列比對,在找出基因

在演化樹中所在的位置。COG database :

基因序列分類方式是依據演化關係,而間接推論基因序列的相似度。

Algorithm

A simple algorithm to infer gene duplication and speciation on agene tree

Report:Wang Wei-Cheng

Gene duplication can be trivially inferred when a species contains two or more homologs belonging to the same gene family(fig1.G1)

Duplication

Duplication

G1

Hu

man

a H

um

an

rHu

man

bN

ematod

e a N

ematod

e rN

ematod

e bY

east a Y

east rY

east b

a subfamily b subfamily r subfamily

Trivial case

Trivial case

Algorithm

Due to gene loss or incomplete sampling of genes in partially sequenced genomes , not all duplications are detectable by simple redundancy in a gene tree(fig1.G2)

Duplication

G2

Hu

man

a N

ematod

e rN

ematod

e bY

east a

a subfamily b subfamily r subfamily

Trivial caseDuplication

nontrivial case

Algorithm

Notation define: G:gene tree S:species tree For any gene g in G,γ(g) sub-tree of G from

g For any specie s in S,σ(s) sub-tree of S

from s

G S

γ(g) σ(s)

Algorithm

Definition1: M:Mapping function from G to S…

Algorithm

Definition1: (G,S) is a rooted binary tree (gene,species)

1.   g ∊ G , let γ(g) a set of species occur from g.2.   s ∊ S , let σ(s) a set of species occur from s.3.   g ∊ G, x=M(g) ∊S, x satisfying

a.smallest (lowest)b.γ(g) ∊σ(x)……….= σ(M(g))  G S

σ(x)γ(g)

x

σ(x)γ(g)

g

γ(g) ∊σ(x)

Algorithm

•Example of γ(g) & σ(s) :

G1

Hu

man

a H

um

an

rHu

man

bN

ematod

e a N

ematod

e rN

ematod

e b Y

east r

Hu

ma

n Nem

atod

e Yea

st

Sg1

g3

g2 s2

s1γ(g2)={Nematode , Human} σ(s2)={Human, Nematode}

γ(g1)={Nematode , Human, Yeast }

σ(s2)={Human, Nematode, Yeast}

Algorithm

•Definition2:{g1,g2} is g’s child, if g is duplication if and only if M(g)=M(g1) or M(g)=M(g2)

Duplication

Duplication

G1

Hu

man

a H

um

an

rHu

man

bN

ematod

e a N

ematod

e rN

ematod

e bY

east a Y

east rY

east b

a subfamily b subfamily r subfamily

g= g1= g2= M(g)= M(g1)= M(g2)= duplication

Hu

ma

n Nem

atod

e Yea

st

S

Algorithm

•If we know all M(),this task take linear time O(n),traversal all tree with n genes.•Page first implement M(),he use brute force find all γ(g) and σ(s) ,then compare them , it takes O(n3)•To speed up, observe M(g)=LCA( M(g1) , M(g2) )

Algorithm

•Input:Rooted binary gene tree G,rooted binary species tree S of all species in G.•Output:G with “duplication” or “speciation” assigned to each of its internal nodes.•Initialization:

1.Number nodes of S in preorder traversal 2.For each external node g of G,set M(g) to the number of external node in S with the matching species name.•Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Algorithm

•Input:Rooted binary gene tree G,rooted binary species tree S of all species in G.

A C B D A B C D

S1G1

Algorithm

A C B D A B C D

g3

g2

g1 S1G1

s3

s2

s1

•Output:G with “duplication” or “speciation” assigned to each of its internal nodes.

<duplication>

< speciation >

< speciation >

Algorithm

Initialization: 1.Number nodes of S in preorder traversal

2.For each external node g of G,set M(g) to the number of external node in

S with the matching species name.

A C B D A B C D

g3

g2

g1 S1G1

s3

s2

s1

Algorithm

Initialization: 1.Number nodes of S in preorder traversal

2.For each external node g of G,set M(g) to the number of external node in

S with the matching species name.

A C B D

g3

g2

g1G1

1

2

3

4 5 6 7

Algorithm

S1

s3

s2

s1 1

2

3

4 5 6 7

A B C D

Initialization: 1.Number nodes of S in preorder traversal

2.For each external node g of G,set M(g) to the number of external node in

S with the matching species name.

A C B D A B C D

g3

g2

g1 S1G1

s3

s2

s11

2

3

4 5 6 7

1

2

3

4 5 6 7

Algorithm

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

A C B D

Algorithm

Status:

Postorder

Starting node is ……. 3

g

4 2 7 1635

internal internal internal

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Starting node is ……. g = g1= g2=

33345

g

g2g1

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6

45

g

g2g1

64

M(g1)

a

b

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

M(g2)

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6

(4!=6) AND (4<6) then….

45

g

g2g1

64

a

b

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6

(4!=6) AND (4<6) then…. b=parent(b)=parent( )= =2

45

g

64

6 2

a

b

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=parent(b)=parent( )= =2

(4!=2) AND (4>2) then….

4

g

64

2

a

b

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=parent(b)=parent( )= =2

(4!=2) AND (4>2) then…. a=parent(a)=parent( )= =3

4

g

64

2

a

b

4 3

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2

(3!=2) AND (3>2) then….

g

6 2

a

b4 3

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2

(3!=2) AND (3>2) then…. a=parent(a)=parent( )= =2

g

6 2a

b4 3

3 2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2

(2= =2) then….

g

6 2a

b3 2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2

(2= =2) then….Set M(g)=M( )=a =

g

6 2a

b3 2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

3 2

M(g)

g2

g1

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

M(g)= M( )= M(g1)=M( )= M(g2)=M( )=

M(g)!=M(g1) AND M(g)!=M(g2)

g

46

2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciationM(g)

g2

g1

M(g1)M(g2)

543

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

g is a speciation

Set speciation tag.3

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Next node is 2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Next node is g= g1= g2=

2

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

63

2

g2

g1

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

g2

g1

a=M(g1)=M( )= = 2 b=M(g2)=M( )= = 5

36 5

2

ab

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

g2

g1

a=M(g1)=M( )= = 2 b=M(g2)=M( )= = 5

(2!=5) AND (2<5) then…. b=parent(b)=parent( )= =3

36 5

2

ab

5 3

b

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =3

(2!=3) AND (2<3) then….

352

ab 3

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =3

(2!=3) AND (2<3) then…. b=parent(b)=parent( )= =2

352

ab 3

3 2

b

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =2

(2= =2) then….

3 2

a 3 2

b

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =2

(2= =2) then…. set M(g)=M( )=a= =2

3 2

a 3 2

b

3 2

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

M(g)= M( )= M(g1)=M( )= M(g2)=M( )=

M(g)= =M(g1) AND M(g)!=M(g2)

2 225

36

g2

g1

Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

g is a duplication

Set duplication tag.2

< duplication >Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

g

Next node is 1

< duplication >Find LCA

A B C D

g3

g2

g1

S1

G1

s3

s2

1

2

3

4 5 6 7

1

2

3

4 5 6 7

A C B D

Algorithm

Status:

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

< speciation >

Next node is

by the same algorithm we got…….

1

< duplication >

< speciation >

Find LCA

Result…

g3

g2

g1

s3

s2

s11

2

3

4 5 6 7

1

2

3

4 5 6 7

G1 S1

< speciation >

< duplication >

< speciation >

Algorithm

Algorithmcomplexity analysis and implementation

A simple algorithm to infer gene duplication and speciation on agene tree

Report:Wang Shiou-Ling

Complexity Analysis

Input size: n, the number of genes species tree

Space complexity : O(n)Assumption: all of tree are binary tree

O(n), nodes of species treeStoring 2 trees(gene tree and species tree)

Internal nodes +leaves=(n-1)+n=2n-1Space complexity:O(4n) O(n)

Storing auxiliary variables (a,b):constant

Complexity AnalysisTime complexity : Given M only in O(n), but what about calculating MBrute force for overall time complexity:O(n3)Traverse node g=1:n

for node g=1:nfor node s=1:n

if γ(g) is in σ(s) then

assign M(g)=s

O(n2)

Complexity AnalysisLCA AlgorithmTime complexity would be reduced to O(n2)Initialization: O(n)

Initializing M for leaves O(n), using Hash Table to look up species name.Initializing S: O(n).

Recursion: W(n2)Best case: O(n)Worst case: balanced S, unbalanced S

Case Study

Case A:O(1) for M(g3) assignment

Start:M(A)=4 ,M(B)=5a=4, b=5While:1st a=4,b=5 b>a,b=3

1

2

3

4 5 6 7

g1

g2

g3

4 5 6 7

Case StudyCase A:O(1) for M(g3) assignment

2nd a=4,b=3 a>b,a=3a=3,b=3Break!M(g3)=3M(g3) M(A) M(B)Speciation

1

2

3

4 5 6 7

g1

g2

g3

4 5 6 7

Case StudyCase A:O(1) for M(g2) assignment

Start:M(g3)=3 ,M(C)=6a=3,b=6While:1st a=3,b=6 b>a,b=2

1

2

3

4 5 6 7

g1

g2

g3

4 5 6 7

Case StudyCase A:O(1) for M(g2) assignment

2nd a=3,b=2 a>b,a=2a=2,b=2Break!M(g2)=2

M(g2) M(A) M(B)SpeciationSo as g1

1

2

3

4 5 6 7

g1

g2

g3

4 5 6 7

Case StudyCase A: Another G for finding M in O(1)

1

2

3

4 5 6 74567

g=g3Start:M(A)=4 ,M(B)=5a=4, b=5While:1st a=4,b=5 b>a,b=3

Case Study

Case A: Another GDefinition : topology of G and S

1

2

3

4 5 6 74567

2nd a=4,b=3 a>b,a=3a=3,b=3Break!M(g3)=3M(g3) M(A) M(B)SpeciationSo as g2,g1

Case StudyCase B:O(log n) for M(g3) assignment

g1

g2

g3

1

2

3 45

6 7

Start:M(A)=3 ,M(C)=6a=3,b=6While:1st a=3,b=6b>a,b=52nd a=3,b=5b>a,b=1

3 46 7

Balanced S

Case Study

Case B:O(log n) for M(g3) assignment

g1

g2

g3

1

3 45

6 7

3rd a=3,b=1a>b,a=2,14th a=2,b=1a>b,a=1a=1,b=1Break!M(g3)=1

M(g3) M(A) M(B)speciation

3 46 7

2

Case StudyCase B:O(log n) for M(g2) assignment

Start:M(g3)=1 ,M(B)=4a=1,b=4While:1st a=1,b=4b>a,b=22nd a=1,b=2b>a,b=1a=1,b=1Break!M(g2)=1M(g2)=M(g3)duplication

g1

g2

g3

1

3 45

6773 46

2

So as g1

Case Study

Case C:O(n) for M(g) assignment

g1

g2

g3

1

2

3

4 5 6 7

•Unbalanced S Observation:For every gene in G should climb up to the root.So time complexity=O(n)

4567

Start:M(D)=7 ,M(C)=6a=7,b=6While:1st a=7,b=6a>b,a=1

Case Study

Case C:O(n) for M(g3) assignment

g1

g2

g3

1

3

4 5 6 7

2nd a=1,b=6b>a,b=23rd a=1,b=2b>a,b=24th a=1,b=2b>a,b=1a=1,b=1 Break!M(g3)=1M(g3) M(D) M(C)speciation

4567

2

Case Study

Case C:O(n) for M(g2) assignment

g1

g2

g3

1

3

4 5 6 7

Start:M(g3)=1 ,M(B)=6a=1,b=6While:1st a=1,b=6b>a,b=22nd a=1,b=2b>a,b=1a=1,b=1Break!M(g2)=1M(g2)=M(g3)Duplication

4567

2

Case Study

Case C:O(n) for M(g1) assignment

g1

g2

g3

1

4 5 6 7

Start:M(g2)=1 ,M(A)=4a=1,b=4While:1st a=1,b=4b>a,b=32nd a=1,b=3b>a,b=2a=1,b=1Break!M(g1)=1M(g1)=M(g2)Duplication

4567

23

g1

g2

g3

1

2

3

4 5 6 77 6 5 4

3

2

1

Tracing parent

Improvement

Little Trick:Would not have crossly Mapping

If one of the children maps to root…

mapping while initialization

Table

LCA

2 3 4 5 6 7

1 1 1 1 1 1 1

2 2 2 2 2 1

3 3 3 2 1

4 3 2 1

5 2 1

6 1

ImprovementPreprocessing

Find LCA in O(1):

Schieber and Vishkin /ja’ja’By direct arithmetic.

Preprocessing in O(n).

Calculating M in in O(nα(n,n)):α(n,n): inverse of Ackermann function

Eulenstein algorithm:Using data structure similar to disjoint-set forest.

Implementation

A Tree Viewer (ATV)

duplications

speciation

numbers bootstrap

values

numbers EC numbers

Implementation

Material:

gene tree: fibrinogen beta and gamma chain

Pfam AC:PF00147

species tree:the Tree of Life project

Run

ImplementationBoth algorithms were implemented in Java.

SDI (Speciation vs Duplication Inference)Eulenstein’s algorithm

PreprocessingDeleting external nodes in S that have no genes in G

Timings reportedAverage of three runs on a single processor 500MHz P-III system running Red Hat Linux 6.0 and Sun Microsystems’ Java 1.2 SDK for Linux

Results – Synthetic DataSynthetic Data Sets– exercise the worst-case behavior.

Synthesized gene trees with n genes

M(g) for every internal node would map to the root of the corresponding species tree with n species.

The situation in Fig. 3B and 3C.

Balanced SO(n logn)

Unbalanced SO(n2)

worst-case behavior

Syn. Data with Balanced S Using SDI algorithm

Syn. Data with UnBalanced S

Using SDI algorithm

Syn. Data with Balanced/UnBalanced S

Using Eulenstein’s algorithm

Results – Synthetic DataFor a balanced species tree, Fig. 3B, both algorithms have running times that scale nearly linearly in tree size. O(n logn)For maximally unbalanced species tree, Fig. 3C, we confirm our algorithm, SDI, worst case O(n2) behavior.Over about n=550 genes and species, our implementation of Eulenstein’s algorithm outperforms SDI.If only the calculation of M(g) is compared (excluding all preprocessing and initialization steps), Eulenstein’s algorithm outperforms SDI for n larger than about 200 taxa.

Results – Real DataReal Data

2478 multiple sequence alignments from the ‘full’ alignments (as opposed to the smaller ‘seed’ alignments) in the protein family database Pfam (release 5.5; Bateman et al., 2000)Alignments were removed

not originating from the curated SWISS-PROT database (Bairoch and Apweiler, 2000) not from species in our species tree (see below)

Alignments were discardedWith fewer than four or more than 1000 sequences

Leaving 1750 alignments

Results – Real DataColumns containing one or more gap symbols were removed from the alignment if the resulting alignment after this filtering was at least 100 amino acids in length.Construct the Gene Tree

Pairwise distances were calculated based on the Dayhoff PAM matrix.Using the program PROTDIST from Felsenstein’s PHYLIP (1993) A neighbor-joining tree was constructed using the program NEIGHBOR from PHYLIP.

Midpoint rooting method (Swofford et al., 1996)

Results – Real DataConstruct the Species Tree

A single master species tree was compiled manually, containing 200 of the most commonly encountered species in Pfam.The topology of this species tree is based on the taxonomy database at NCBI (http://www.ncbi.nlm.nih.gov/Taxonomy/tax.html/), the Tree of Life project (Madison and Madison, at http://phylogeny.arizona.edu/tree/phylogeny.html )This tree is available at http://www.genetics.wustl.edu/eddy/forester/

Real Data using Eulenstein’s algorithm

Real Data using SDI algorithm

Results – Real DataThe average case behavior of SDI algorithm on real data sets is approximately O(n).

Worst case is not realized.

Analysis exampleThe fibrinogen beta and gamma chain Pfam family is presented in figure 5.The fibrinogen sequence family contains fibrinogen alpha, beta and gamma chains (sequences with FIBA, FIBB, FIBG prefixes).

Analysis exampleEach chain type appears on the tree as a paralogous subtree.A special case is FIBH_HUMAN (fibrinogen gamma-B chain)

It appears to be the result of alternative splicing of the human gamma chain gene.

Sequences with TENA prefixes (such as Tenascins)The fibrinogen family also contains various proteins probably involved in adhesion, which share the fibrinogen-like domain with the fibrinogen sequences.

Analysis exampleInteresting case—FIBX_MOUSE

A mouse enzyme with prothrombinase activityIs similar to fibrinogen beta and gamma chains (Parr er al. 1995)

The node connecting FIBX_MOUSE to the rest of the tree is inferred to be a duplication event.Since the placement of FIBX_MOUSE contradicts the species tree and hence FIBX_MOUSE is inferred to be paralogous to the fibrinogen beta chain subfamily (FIBB).In contrast, BLAST analysis of the FIBX_MOUSE sequence could easily have misannotated it as the mouse fibrinogen beta chain.

Motivation: Orthologous sequences are more

reliable predictors of a new protein’s function than paralogous sequences

Goal: Automate phylogenomics using

explicit phylogenetic inference.

A simple algorithm to infer gene duplication and

speciation events on a gene tree

Performance

Difficulties for practical useRootedBiological correct

Reliability

Discussion

Performance

The comparison of asymptotic worst-case running time may be misleadingOur algorithm is O(n2)Empirically outperforms Eulenstein’s (1998) - more complex, asymptotic bound close to O(n)

Worst case : pathological M(g) for every internal node points to the rootNo two genes from the same species, no. in S is O(n), S is maximally unbalancedIn real data, O(n)

Performance

The improved asymptotic bound will not be worth the cost of the extra complexity nor the extra computational overhead

Practical use

Use SDI as part of a system for

automating phylogenomics

( forester ) .

Assumption : the gene tree and

species tree are both properly

rooted and biological correct.

Rooted

Rooted properly

Molecular clock

•No. of substitution time back to common ancestor , constant rate

•Dubious in sequences family

•In paralogous sequences family, depend on duplication inference

•Minimize the dissimilarity between gene tree and species tree

Outgroup

Molecular Clock

Reliability

Problematic : duplication to predict function

Multifurcations : lack of resolution

Limitation of algorithm

Concept of orthology and paralogy

Reliability sampling

BootstrapMCMC ( Markov Chain Monte Carlo )Integrate orthology assignments over tree spaceProbability, confidence valueRank the inferred orthologyAlso help to root

….

Bootstrap -- resampling

Recommended