39
http://creativecommons.org/licenses/by- sa/2.0/

creativecommons/licenses/by-sa/2.0

  • Upload
    tassos

  • View
    27

  • Download
    2

Embed Size (px)

DESCRIPTION

http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 8. Usman Roshan. Some of the slides are based upon material by Dennis Livesay and David La of California State University at Pomona. Previously…. Evaluation of multiple sequence alignments. - PowerPoint PPT Presentation

Citation preview

Page 1: creativecommons/licenses/by-sa/2.0

http://creativecommons.org/licenses/by-sa/2.0/

Page 2: creativecommons/licenses/by-sa/2.0

CIS786, Lecture 8

Usman Roshan

Some of the slides are based upon material by Dennis Livesay and David La of California StateUniversity at Pomona

Page 3: creativecommons/licenses/by-sa/2.0

Previously…

Page 4: creativecommons/licenses/by-sa/2.0

Evaluation of multiple sequence alignments

• Compare to benchmark “true” alignments

• Use simulation

• Measure conservation of an alignment

• Measure accuracy of phylogenetic trees

• How well does it align motifs?

Page 5: creativecommons/licenses/by-sa/2.0

ROSE

• Evolve sequences under an i.i.d. Markov Model• Root sequence: probabilities given by a probability vector

(for proteins default is Dayhoff et. al. values)• Substitutions

– Edge length are integers– Probability matrix M is given as input (default is PAM1*)– For edge of length b probabilty of x y is given by Mb

xy

• Insertion and deletions:– Insertions and deletions follow the same probabilistic model– For each edge probability to insert is iins . – Length of insertion is given by discrete probability distribution

(normally exponential)– For edge of length b this is repeated b times.

• Model tree can be specified as input

Page 6: creativecommons/licenses/by-sa/2.0

Phylogeny accuracy

Page 7: creativecommons/licenses/by-sa/2.0

Running time

Page 8: creativecommons/licenses/by-sa/2.0

Evaluating alignments using motif detection

• Let’s evaluate alignments by searching for motifs

• If alignment X reveals more functional motifs than Y using technique Z then X is better than Y w.r.t. Z

• Motifs could be functional sites in proteins or functional regions in non-coding DNA

Page 9: creativecommons/licenses/by-sa/2.0

What is a “Functional Site”?

• Defining what constitutes a “functional site” is not trivial

• Residues that include and cluster around known functionality are clear candidates for functional sites

• We define a functional site as catalytic residues, binding sites, and regions that clustering around them.

Page 10: creativecommons/licenses/by-sa/2.0

Functional Sites (FS)

Page 11: creativecommons/licenses/by-sa/2.0

Phylogenetic motifs

• PMs are short sequence fragments that conserve the overall familial phylogeny

• Are they functional?

• How do we detect them?

Page 12: creativecommons/licenses/by-sa/2.0

Map PMs to the Map PMs to the StructureStructure

Map

2DBL DVVMTQIPLSLPVNLGDQASISCRSSQSLIHSNGNTYLHWYLQKPGQSPKLLMYKVSNRF 1NCA DIVMTQSPKFMSTSVGDRVTITCKASQ-----DVSTAVVWYQQKPGQSPKLLIYWASTRH 2JEL DVLMTQTPLSLPVSLGDQASISCRSSQSIVHGNGNTYLEWYLQKPGQSPKLLIYKISNRF 2IGF DVLMTQTPLSLPVSLGDQASISCRSNQTILLSDGDTYLEWYLQKPGQSPKLLIYKVSNRF 3HFM DIVLTQSPATLSVTPGNSVSLSCRASQS-----IGNNLHWYQQKSHESPRLLIKYASQSI 3HFL DIVLTQSPAIMSASPGEKVTMTCSASSS------VNYMYWYQQKSGTSPKRWIYDTSKLA 1NGP QAVVTQES-ALTTSPGETVTLTCRSSTG--AVTTSNYANWVQEKPDHLFTGLIGGTNNRA 2DBL YGVPDRFSGSGSGTDFTLKISRVEAEDLGIYFCSQSSHVPPTFGGGTKLEIK-RADAAPT 1NCA IGVPDRFAGSGSGTDYTLTISSVQAEDLALYYCQQHYSPPWTFGGGTKLEIK-RADAAPT 2JEL SGVPDRFSGSGSGTDFTLKISRVEAEDLGVYYCFQGSHVPYTFGGGTKLEIK-RADAAPT 2IGF SGVPDRFSGSGSGTDFTLKISRVEAEDLGVYYCFQGSHVPPTFGGGTKLEIK-RADAAPT 3HFM SGIPSRFSGSGSGTDFTLSINSVETEDFGMYFCQQSNSWPYTFGGGTKLEIK-RADAAPT 3HFL SGVPVRFSGSGSGTSYSLTISSMETEDAATYYCQQWGRNP-TFGGGTKLEIK-RADAAPT 1NGP PGVPARFSGSLIGDKAALTITGAQTEDEAIYFCALWYSNHWVFGGGTKLTVLGQPKSSPS 2DBL MSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNR---QIQLVQSGPELKKPGETVKI 1NCA MSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNECQIQLVQSGPELKKPGETVKI 2JEL MSSTLTLTKDEYERHNSYTCEATHKTSDSPIVKSFNRN--QVQLAQSGPELVRPGVSVKI 2IGF MSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNECEVQLVESGGDLVKPGGSLKL 3HFM MSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNECDVQLQESGPSLVKPSQTLSL 3HFL MSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNECXVQLQQSGAELMKPGASVKI 1NGP ASSYLTLTARAWERHSSYSCQVTHEGHT--VEKSLSR---QVQLQQPGAELVKPGASVKL

Set PSZ Threshold

Page 13: creativecommons/licenses/by-sa/2.0

PMs in Various PMs in Various StructuresStructures

Page 14: creativecommons/licenses/by-sa/2.0

PMs and Traditional PMs and Traditional MotifsMotifs

Page 15: creativecommons/licenses/by-sa/2.0

TIMTIM

Phylogenetic Similarity False Positive Expectation

Page 16: creativecommons/licenses/by-sa/2.0

Cytochrome P450Cytochrome P450

Phylogenetic Similarity False Positive Expectation

Page 17: creativecommons/licenses/by-sa/2.0

EnolaseEnolase

Phylogenetic Similarity False Positive Expectation

Page 18: creativecommons/licenses/by-sa/2.0

Glycerol KinaseGlycerol Kinase

Phylogenetic Similarity False Positive Expectation

Page 19: creativecommons/licenses/by-sa/2.0

MyoglobinMyoglobin

Phylogenetic Similarity False Positive Expectation

Page 20: creativecommons/licenses/by-sa/2.0

Evaluating alignments

• For a given alignment compute the PMs

• Determine the number of functional PMs

• Those identifying more functional PMs will be classified as better alignments

Page 21: creativecommons/licenses/by-sa/2.0

Running time

Page 22: creativecommons/licenses/by-sa/2.0

Functional PMsPAl=blueMUSCLE=redBoth=green

(a)=enolase, (b)ammonia channel,(c)=tri-isomerase, (d)=permease,(e)=cytochrome

Page 23: creativecommons/licenses/by-sa/2.0

Today

• More simulations…

• Comparison of MP and NJ trees on different protein alignments

• Simultaneous alignment and phylogeny reconstruction– Starting trees for POY– Boosting it with RecIDCM3

Page 24: creativecommons/licenses/by-sa/2.0

NJ vs MP on 50 taxa and 500 mean sequence length

NJ MP

Page 25: creativecommons/licenses/by-sa/2.0

NJ vs MP on 100 taxa and 500 mean sequence length

NJ MP

Page 26: creativecommons/licenses/by-sa/2.0

NJ vs MP on 400 taxa and 500 mean sequence length

NJ MP

Page 27: creativecommons/licenses/by-sa/2.0

MP trees on 800 taxa alignments

Page 28: creativecommons/licenses/by-sa/2.0

Increasing sequence lengths on 50 taxa datasets

200 500 1000

Page 29: creativecommons/licenses/by-sa/2.0

Increasing sequence lengths on 400 taxa datasets

200 500 1000

Page 30: creativecommons/licenses/by-sa/2.0

Simultaneous alignment and phylogeny reconstruction---POY

• Performs TBR through tree space to search for better tree alignments

• Uses variant of progressive alignment without profiles– Assigns ancestral sequences to internal

nodes using MP– Removes gaps in ancestral sequences

• Optional median alignment is possible

Page 31: creativecommons/licenses/by-sa/2.0

Starting trees for POY• Poy-default (greedy method) • Poy-approxbuild (faster greedy method) • Heuristic maximum parsimony trees generated on the following

alignments using the TNT program (TBR search with one saved tree): – ClustalW(fast distance estimation) – Muscle1(default): progressive alignment (BLASTZ scoring matrix) – Muscle2(default): improved iterative progressive alignment (BLASTZ

scoring matrix) – Muscle1MP: progressive alignment (scoring matrix for parsimony:

match=1, mismatch=0, gapopen=gapextend=-1) – Muscle2MP: improved iterative progressive alignment (parsimony

scoring matrix as above) – Muscle1MP(CW-guidetree): Muscle1-MP on the ClustalW guide-tree

(fast distance estimation)

Page 32: creativecommons/licenses/by-sa/2.0

Simulation study parameters

• Model trees: uniform random distribution and uniformly selected random edge lengths

• Model of evolution: HKY95 with insertions and deletions probabilities selected from a gamma distribution (see ROSE software package)

• Generated data: Settings of 250, 500, 1000 taxa, mean sequence lengths of 1000 and 2000, and avg branch lengths of 0.2 were selected. For each setting 1 dataset was produced.

• Criterion for branch length and sequence length selection: Evolutionary rate was selected such that the starting Poy tree was between 20% and 30% error rate---not too hard or easy. Mean sequence lengths of 1000 and 2000 are realistic for protein coding sequences.

Page 33: creativecommons/licenses/by-sa/2.0

Comparison of Poy to MUSCLE and ClustalW under simulation

250 taxa, 941 mean sequence length, 0.2 avg branch length

Page 34: creativecommons/licenses/by-sa/2.0

Comparison of Poy to MUSCLE and ClustalW under simulation

500 taxa, 981 mean sequence length, 0.2 avg branch length

Page 35: creativecommons/licenses/by-sa/2.0

Comparison of Poy to MUSCLE and ClustalW under simulation

1000 taxa, 993 mean sequence length, 0.2 avg branch length

Page 36: creativecommons/licenses/by-sa/2.0

Comparison of Poy to MUSCLE and ClustalW on real data

218 taxa RNA metazoan dataset

Page 37: creativecommons/licenses/by-sa/2.0

Comparison of Poy to MUSCLE and ClustalW on real data

585 taxa RNA archaea dataset

Page 38: creativecommons/licenses/by-sa/2.0

Comparison of Poy to MUSCLE and ClustalW on real data

1040 taxa RNA mitochondria dataset

Page 39: creativecommons/licenses/by-sa/2.0

Comparison of Poy to MUSCLE and ClustalW on real data

1766 taxa RNA metazoa dataset