A simple algorithm to infer gene duplication and speciation events on a gene tree...

A simple algorithm to infer gene duplication and speciation

events on a gene tree

生物資訊相關演算法期末報告

學生：陳智豪王秀綾王緯誠江志民侯藹玲時間： 1 月 21 日地點：中研院資訊所

C. M. Zmasek and S. R. Eddy, （ 2001 ）Bioinformatics, 17(9): 821--828,

INTRODUCTION

The enormous amount of sequence data currently produced by the various genome project.

Many proteins belong to large superfamilies that consist of subfamilies with different biological function complicates.

Superfamily and subfamilySuperfamily 以物種的來源區分，也就是說，雖然是不同蛋白質間的胺基酸序列相似性程度不高，但是它們的結構與功能相近，顯示它們可能有共同的演化來源，便視為同一個Superfamily 。Subfamily 是用演化的相關性來歸類，通常蛋白質之間的胺基酸序列有大於 30% 的相同，便可視為有明顯的演化關連而屬於同一個subfamily 。值得注意的是，胺基酸序列的同質性高並不等於結構和／或功能的相似。

Method for automated sequence function prediction

Pairwise sequence similarity such as BLAST (basic local alignment search tool).

Analyses using profile search algorithms such as HMMER.

Protein family databases such as Pfam and InterPro.

What is “phylogenomics”

To use this multiple alignment to infer amore specific function,as input for a phylogenetic tree analysis, and from the placement of the new sequence in the tree of known sequences.

Why using phylogenomics ？ In many cases , the identification of homologs is not sufficient to make specific functional predictions , because not all homologs have the same function.

Paralogous vs orthologous

Two genes are said to be paralogous if they are derived from a duplication event, but orthologous if they are derived from a speciation event.

Paralogous vs orthologous

Gene trees vs species tree

COG database

Although the COG method is clear a major advance in identifying orthologous groups of genes , it is limited in its power because clustering is a way of classifying levels of similarity and is not an accurate method of inferring evolutionary relationships.

Phylogenomics and COG database

Phylogenomics ：是利用多基因序列比對，在找出基因

在演化樹中所在的位置。COG database ：

基因序列分類方式是依據演化關係，而間接推論基因序列的相似度。

Algorithm

A simple algorithm to infer gene duplication and speciation on agene tree

Report:Wang Wei-Cheng

Gene duplication can be trivially inferred when a species contains two or more homologs belonging to the same gene family(fig1.G1)

Duplication

ematod

east a Y

east rY

east b

a subfamily b subfamily r subfamily

Trivial case

Algorithm

Due to gene loss or incomplete sampling of genes in partially sequenced genomes , not all duplications are detectable by simple redundancy in a gene tree(fig1.G2)

Duplication

ematod

east a

Trivial caseDuplication

nontrivial case

Algorithm

Notation define: G:gene tree S:species tree For any gene g in G,γ(g) sub-tree of G from

g For any specie s in S,σ(s) sub-tree of S

from s

γ(g) σ(s)

Algorithm

Definition1: M:Mapping function from G to S…

Algorithm

Definition1: (G,S) is a rooted binary tree (gene,species)

1. g ∊ G , let γ(g) a set of species occur from g.2. s ∊ S , let σ(s) a set of species occur from s.3. g ∊ G, x=M(g) ∊S, x satisfying

a.smallest (lowest)b.γ(g) ∊σ(x)……….= σ(M(g)) G S

σ(x)γ(g)

γ(g) ∊σ(x)

Algorithm

•Example of γ(g) & σ(s) :

ematod

east r

s1γ(g2)={Nematode , Human} σ(s2)={Human, Nematode}

γ(g1)={Nematode , Human, Yeast }

σ(s2)={Human, Nematode, Yeast}

Algorithm

•Definition2:{g1,g2} is g’s child, if g is duplication if and only if M(g)=M(g1) or M(g)=M(g2)

Duplication

ematod

east a Y

east rY

east b

g= g1= g2= M(g)= M(g1)= M(g2)= duplication

Algorithm

•If we know all M(),this task take linear time O(n),traversal all tree with n genes.•Page first implement M(),he use brute force find all γ(g) and σ(s) ,then compare them , it takes O(n3)•To speed up, observe M(g)=LCA( M(g1) , M(g2) )

Algorithm

•Input:Rooted binary gene tree G,rooted binary species tree S of all species in G.•Output:G with “duplication” or “speciation” assigned to each of its internal nodes.•Initialization:

1.Number nodes of S in preorder traversal 2.For each external node g of G,set M(g) to the number of external node in S with the matching species name.•Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

Algorithm

•Input:Rooted binary gene tree G,rooted binary species tree S of all species in G.

A C B D A B C D

Algorithm

A C B D A B C D

g1 S1G1

•Output:G with “duplication” or “speciation” assigned to each of its internal nodes.

< speciation >

Algorithm

Initialization: 1.Number nodes of S in preorder traversal

2.For each external node g of G,set M(g) to the number of external node in

S with the matching species name.

A C B D A B C D

g1 S1G1

Algorithm

A C B D

4 5 6 7

Algorithm

4 5 6 7

A B C D

A C B D A B C D

g1 S1G1

4 5 6 7

Algorithm

A B C D

4 5 6 7

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciation

A C B D

Algorithm

Status:

Postorder

Starting node is ……. 3

4 2 7 1635

internal internal internal

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

Starting node is ……. g = g1= g2=

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6

(4!=6) AND (4<6) then….

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=M(g2)=M( )= = 6

(4!=6) AND (4<6) then…. b=parent(b)=parent( )= =2

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=parent(b)=parent( )= =2

(4!=2) AND (4>2) then….

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

a=M(g1)=M( )= = 4 b=parent(b)=parent( )= =2

(4!=2) AND (4>2) then…. a=parent(a)=parent( )= =3

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

a=parent(a)=parent( )= =3 b=parent(b)=parent( )= =2

(3!=2) AND (3>2) then….

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

(3!=2) AND (3>2) then…. a=parent(a)=parent( )= =2

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

(2= =2) then….

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

(2= =2) then….Set M(g)=M( )=a =

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

M(g)= M( )= M(g1)=M( )= M(g2)=M( )=

M(g)!=M(g1) AND M(g)!=M(g2)

Recursion: Visit each internal node g of G in postorder traversal set (a,b)=(M(g1) , M(g2)) while (a!=b) { if (a>b) a= parent(a) else b=parent(b) } set M(g)=a set M(g) = if (M(g)==M(g1)) or(M(g)==M(g2)) g is a duplication else g is a speciationM(g)

M(g1)M(g2)

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

g is a speciation

Set speciation tag.3

< speciation >

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

Next node is 2

< speciation >

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

Next node is g= g1= g2=

< speciation >

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

< speciation >

a=M(g1)=M( )= = 2 b=M(g2)=M( )= = 5

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

< speciation >

a=M(g1)=M( )= = 2 b=M(g2)=M( )= = 5

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

< speciation >

a=M(g1)=M( )= = 2 b =parent(b)=parent( )= =3

(2!=3) AND (2<3) then….

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

< speciation >

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

< speciation >

(2= =2) then….

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

< speciation >

(2= =2) then…. set M(g)=M( )=a= =2

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

< speciation >

M(g)= M( )= M(g1)=M( )= M(g2)=M( )=

M(g)= =M(g1) AND M(g)!=M(g2)

Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

< speciation >

g is a duplication

Set duplication tag.2

< duplication >Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

< speciation >

Next node is 1

< duplication >Find LCA

A B C D

4 5 6 7

A C B D

Algorithm

Status:

< speciation >

Next node is

by the same algorithm we got…….

< duplication >

< speciation >

Find LCA

Result…

4 5 6 7

< speciation >

< duplication >

< speciation >

Algorithm

Algorithmcomplexity analysis and implementation

A simple algorithm to infer gene duplication and speciation on agene tree

Report:Wang Shiou-Ling

Complexity Analysis

Input size: n, the number of genes species tree

Space complexity : O(n)Assumption: all of tree are binary tree

O(n), nodes of species treeStoring 2 trees(gene tree and species tree)

Internal nodes +leaves=(n-1)+n=2n-1Space complexity:O(4n) O(n)

Storing auxiliary variables (a,b):constant

Complexity AnalysisTime complexity : Given M only in O(n), but what about calculating MBrute force for overall time complexity:O(n3)Traverse node g=1:n

for node g=1:nfor node s=1:n

if γ(g) is in σ(s) then

assign M(g)=s

Complexity AnalysisLCA AlgorithmTime complexity would be reduced to O(n2)Initialization: O(n)

Initializing M for leaves O(n), using Hash Table to look up species name.Initializing S: O(n).

Recursion: W(n2)Best case: O(n)Worst case: balanced S, unbalanced S

Case Study

Case A:O(1) for M(g3) assignment

Start:M(A)=4 ,M(B)=5a=4, b=5While:1st a=4,b=5 b>a,b=3

4 5 6 7

Case StudyCase A:O(1) for M(g3) assignment

2nd a=4,b=3 a>b,a=3a=3,b=3Break!M(g3)=3M(g3) M(A) M(B)Speciation

4 5 6 7

Start:M(g3)=3 ,M(C)=6a=3,b=6While:1st a=3,b=6 b>a,b=2

4 5 6 7

2nd a=3,b=2 a>b,a=2a=2,b=2Break!M(g2)=2

M(g2) M(A) M(B)SpeciationSo as g1

4 5 6 7

Case StudyCase A: Another G for finding M in O(1)

4 5 6 74567

g=g3Start:M(A)=4 ,M(B)=5a=4, b=5While:1st a=4,b=5 b>a,b=3

Case Study

Case A: Another GDefinition : topology of G and S

4 5 6 74567

2nd a=4,b=3 a>b,a=3a=3,b=3Break!M(g3)=3M(g3) M(A) M(B)SpeciationSo as g2,g1

Case StudyCase B:O(log n) for M(g3) assignment

Start:M(A)=3 ,M(C)=6a=3,b=6While:1st a=3,b=6b>a,b=52nd a=3,b=5b>a,b=1

3 46 7

Balanced S

Case Study

Case B:O(log n) for M(g3) assignment

3rd a=3,b=1a>b,a=2,14th a=2,b=1a>b,a=1a=1,b=1Break!M(g3)=1

M(g3) M(A) M(B)speciation

3 46 7

Case StudyCase B:O(log n) for M(g2) assignment

Start:M(g3)=1 ,M(B)=4a=1,b=4While:1st a=1,b=4b>a,b=22nd a=1,b=2b>a,b=1a=1,b=1Break!M(g2)=1M(g2)=M(g3)duplication

6773 46

So as g1

Case Study

Case C:O(n) for M(g) assignment

4 5 6 7

•Unbalanced S Observation:For every gene in G should climb up to the root.So time complexity=O(n)

Start:M(D)=7 ,M(C)=6a=7,b=6While:1st a=7,b=6a>b,a=1

Case Study

Case C:O(n) for M(g3) assignment

4 5 6 7

2nd a=1,b=6b>a,b=23rd a=1,b=2b>a,b=24th a=1,b=2b>a,b=1a=1,b=1 Break!M(g3)=1M(g3) M(D) M(C)speciation

Case Study

4 5 6 7

Start:M(g3)=1 ,M(B)=6a=1,b=6While:1st a=1,b=6b>a,b=22nd a=1,b=2b>a,b=1a=1,b=1Break!M(g2)=1M(g2)=M(g3)Duplication

Case Study

4 5 6 7

Start:M(g2)=1 ,M(A)=4a=1,b=4While:1st a=1,b=4b>a,b=32nd a=1,b=3b>a,b=2a=1,b=1Break!M(g1)=1M(g1)=M(g2)Duplication

4 5 6 77 6 5 4

Tracing parent

Improvement

Little Trick:Would not have crossly Mapping

If one of the children maps to root…

mapping while initialization

2 3 4 5 6 7

1 1 1 1 1 1 1

2 2 2 2 2 1

3 3 3 2 1

4 3 2 1

ImprovementPreprocessing

Find LCA in O(1):

Schieber and Vishkin /ja’ja’By direct arithmetic.

Preprocessing in O(n).

Calculating M in in O(nα(n,n)):α(n,n): inverse of Ackermann function

Eulenstein algorithm:Using data structure similar to disjoint-set forest.

Implementation

A Tree Viewer (ATV)

duplications

speciation

numbers bootstrap

values

numbers EC numbers

Implementation

Material:

gene tree: fibrinogen beta and gamma chain

Pfam AC:PF00147

species tree:the Tree of Life project

ImplementationBoth algorithms were implemented in Java.

SDI (Speciation vs Duplication Inference)Eulenstein’s algorithm

PreprocessingDeleting external nodes in S that have no genes in G

Timings reportedAverage of three runs on a single processor 500MHz P-III system running Red Hat Linux 6.0 and Sun Microsystems’ Java 1.2 SDK for Linux

Results – Synthetic DataSynthetic Data Sets– exercise the worst-case behavior.

Synthesized gene trees with n genes

M(g) for every internal node would map to the root of the corresponding species tree with n species.

The situation in Fig. 3B and 3C.

Balanced SO(n logn)

Unbalanced SO(n2)

worst-case behavior

Syn. Data with Balanced S Using SDI algorithm

Syn. Data with UnBalanced S

Using SDI algorithm

Syn. Data with Balanced/UnBalanced S

Using Eulenstein’s algorithm

Results – Synthetic DataFor a balanced species tree, Fig. 3B, both algorithms have running times that scale nearly linearly in tree size. O(n logn)For maximally unbalanced species tree, Fig. 3C, we confirm our algorithm, SDI, worst case O(n2) behavior.Over about n=550 genes and species, our implementation of Eulenstein’s algorithm outperforms SDI.If only the calculation of M(g) is compared (excluding all preprocessing and initialization steps), Eulenstein’s algorithm outperforms SDI for n larger than about 200 taxa.

Results – Real DataReal Data

2478 multiple sequence alignments from the ‘full’ alignments (as opposed to the smaller ‘seed’ alignments) in the protein family database Pfam (release 5.5; Bateman et al., 2000)Alignments were removed

not originating from the curated SWISS-PROT database (Bairoch and Apweiler, 2000) not from species in our species tree (see below)

Alignments were discardedWith fewer than four or more than 1000 sequences

Leaving 1750 alignments

Results – Real DataColumns containing one or more gap symbols were removed from the alignment if the resulting alignment after this filtering was at least 100 amino acids in length.Construct the Gene Tree

Pairwise distances were calculated based on the Dayhoff PAM matrix.Using the program PROTDIST from Felsenstein’s PHYLIP (1993) A neighbor-joining tree was constructed using the program NEIGHBOR from PHYLIP.

Midpoint rooting method (Swofford et al., 1996)

Results – Real DataConstruct the Species Tree

A single master species tree was compiled manually, containing 200 of the most commonly encountered species in Pfam.The topology of this species tree is based on the taxonomy database at NCBI (http://www.ncbi.nlm.nih.gov/Taxonomy/tax.html/), the Tree of Life project (Madison and Madison, at http://phylogeny.arizona.edu/tree/phylogeny.html )This tree is available at http://www.genetics.wustl.edu/eddy/forester/

Real Data using Eulenstein’s algorithm

Real Data using SDI algorithm

Results – Real DataThe average case behavior of SDI algorithm on real data sets is approximately O(n).

Worst case is not realized.

Analysis exampleThe fibrinogen beta and gamma chain Pfam family is presented in figure 5.The fibrinogen sequence family contains fibrinogen alpha, beta and gamma chains (sequences with FIBA, FIBB, FIBG prefixes).

Analysis exampleEach chain type appears on the tree as a paralogous subtree.A special case is FIBH_HUMAN (fibrinogen gamma-B chain)

It appears to be the result of alternative splicing of the human gamma chain gene.

Sequences with TENA prefixes (such as Tenascins)The fibrinogen family also contains various proteins probably involved in adhesion, which share the fibrinogen-like domain with the fibrinogen sequences.

Analysis exampleInteresting case—FIBX_MOUSE

A mouse enzyme with prothrombinase activityIs similar to fibrinogen beta and gamma chains (Parr er al. 1995)

The node connecting FIBX_MOUSE to the rest of the tree is inferred to be a duplication event.Since the placement of FIBX_MOUSE contradicts the species tree and hence FIBX_MOUSE is inferred to be paralogous to the fibrinogen beta chain subfamily (FIBB).In contrast, BLAST analysis of the FIBX_MOUSE sequence could easily have misannotated it as the mouse fibrinogen beta chain.

Motivation: Orthologous sequences are more

reliable predictors of a new protein’s function than paralogous sequences

Goal: Automate phylogenomics using

explicit phylogenetic inference.

A simple algorithm to infer gene duplication and

speciation events on a gene tree

Performance

Difficulties for practical useRootedBiological correct

Reliability

Discussion

Performance

The comparison of asymptotic worst-case running time may be misleadingOur algorithm is O(n2)Empirically outperforms Eulenstein’s (1998) - more complex, asymptotic bound close to O(n)

Worst case : pathological M(g) for every internal node points to the rootNo two genes from the same species, no. in S is O(n), S is maximally unbalancedIn real data, O(n)

Performance

The improved asymptotic bound will not be worth the cost of the extra complexity nor the extra computational overhead

Practical use

Use SDI as part of a system for

automating phylogenomics

（ forester ） .

Assumption : the gene tree and

species tree are both properly

rooted and biological correct.

Rooted

Rooted properly

Molecular clock

•No. of substitution time back to common ancestor , constant rate

•Dubious in sequences family

•In paralogous sequences family, depend on duplication inference

•Minimize the dissimilarity between gene tree and species tree

Outgroup

Molecular Clock

Reliability

Problematic : duplication to predict function

Multifurcations ： lack of resolution

Limitation of algorithm

Concept of orthology and paralogy

Reliability sampling

BootstrapMCMC （ Markov Chain Monte Carlo ）Integrate orthology assignments over tree spaceProbability, confidence valueRank the inferred orthologyAlso help to root

Bootstrap -- resampling

A simple algorithm to infer gene duplication and speciation events on a gene tree...

Documents

S池ゝ1｣ZJ S王ヨン王=(m UCrm R 王棚 T CT∃∃S

1 Gene Expression Overview. 2 Gene Expression Gene Expression The Gene Structure The Gene Structure Protein Synthesis Protein Synthesis

D0053888 王家樺

UNIVERSITI PUTRA MALAYSIA GENESIZE POLYMORPHISM … · gapA gene, LP gene, F-strain-specific DNA fragment gene, CrmB gene, CrmC gene, p47 gene, HMW3-like protein gene and pneumoniae-like

Reconstructing gene networks Analysing the properties of gene networks Gene Networks Using gene expression data to reconstruct gene networks

By 0318 王爽

B：三王コード：チロルチョコ三王コード：チロルチョコチロルミルクヌガー 22g 30×12 チョコレート個： B： c/s：三王コード：263201

Unit 3 – Lesson 2 What Animal Are You? Second period 王莺王莺

伟大作家的三次降生印象王小王原名王瑨王小王wyb.chinawriter.com.cn/attachment/201805/09/9534409e-3e64-4436-90c4-b9... · 等；曾主编《新实力华语作家

2020 07月期八王子 ura...CMYK 文化センター八王子表面文化センター八王子裏面 CMYK 文化センター八王子裏面 CMYK 文化センター八王子裏面

王富民 Cs101212

王卫东 2013.05

教授 : 王明賢學生 : 王沼奇

1002445062 王培臣

Legal and Moral implications of cloning 组员：刘秀玲、杜倩倩、王晶、李丹丹、王娜、王雪宁

2020 14 号 - ujs.edu.cnfxb.ujs.edu.cn/__local/2/B4/62/B56DA9B0D2EAEA97F03... · 王化敦、王红春、王宏、王晓东、王银磊、龙卫华、吕远大、任润生、刘瑞显、张春华、陈健、彭琦

第六屆邀請敬呈富邦文教基金會＿陳藹玲女士

Regulation of Gene Expression - Parkway Schools Bio Fall 2012/18... · Regulation of gene expression trpE gene trpD gene trpC gene trpB gene trpA gene (b) Regulation of enzyme production

Ex2＿B10310011＿王乙雯

Hubert Gene “Gene-o” Shipley 1973 ”GENE-O”!