36
EVEN BRIDGES GENOMICS, LLC Short introduction to Bioinformatics What are the Probabilistic Models? Sequence Alignment Pairwise Alignment Multiple Sequence Alignment Models What is Phylogenetics? Building Phylogenetic Trees Other Models Conctact Us Introduction to Probabilistic Models for Bioinformatics Igor Bogicevic ([email protected]) July 3, 2011 Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Introduction to Probabilistic Models for Bioinformatics

Embed Size (px)

Citation preview

Page 1: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Introduction to Probabilistic Models for Bioinformatics

Igor Bogicevic ([email protected])

July 3, 2011

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 2: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Short introduction to Bioinformatics

I Bioinformatics is the application of statistics and computer science to the field ofmolecular biology.

I Major research efforts in the field include sequence alignment, gene finding,genome assembly, drug design, drug discovery, protein structure alignment,protein structure prediction, prediction of gene expression and protein-proteininteractions, genome-wide association studies and the modeling of evolution.

I At the current moment, given the enormous volumes of sequenced data, one ofthe biggest challenges is not producing, but actually understanding the data.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 3: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Short introduction to Bioinformatics

I Bioinformatics is the application of statistics and computer science to the field ofmolecular biology.

I Major research efforts in the field include sequence alignment, gene finding,genome assembly, drug design, drug discovery, protein structure alignment,protein structure prediction, prediction of gene expression and protein-proteininteractions, genome-wide association studies and the modeling of evolution.

I At the current moment, given the enormous volumes of sequenced data, one ofthe biggest challenges is not producing, but actually understanding the data.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 4: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Short introduction to Bioinformatics

I Bioinformatics is the application of statistics and computer science to the field ofmolecular biology.

I Major research efforts in the field include sequence alignment, gene finding,genome assembly, drug design, drug discovery, protein structure alignment,protein structure prediction, prediction of gene expression and protein-proteininteractions, genome-wide association studies and the modeling of evolution.

I At the current moment, given the enormous volumes of sequenced data, one ofthe biggest challenges is not producing, but actually understanding the data.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 5: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

What are the Probabilistic Models?

I There are 2 basic definitions:

I Statistical analysis tool that estimates, on the basis of past (historical) data, theprobability of an event occurring again.

I Probabilistic model is a system that simulates the object under the considerationand produces different outcomes with different probabilities.

I Simple example - rolling a die.

I A bit more relevant example - random sequence model in DNA .

I Biological sequences are strings from a finite alphabet of residues, mostcommonly either four nucleotides, or twenty amino acids.

I Imagine that a residue a occurs with probability qa, if protein or DNA sequence isdenoted x1...xn, then probability of the whole sequence is:

qx1qx2 ...qxn =nY

i=1

qxi

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 6: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

What are the Probabilistic Models?

I There are 2 basic definitions:

I Statistical analysis tool that estimates, on the basis of past (historical) data, theprobability of an event occurring again.

I Probabilistic model is a system that simulates the object under the considerationand produces different outcomes with different probabilities.

I Simple example - rolling a die.

I A bit more relevant example - random sequence model in DNA .

I Biological sequences are strings from a finite alphabet of residues, mostcommonly either four nucleotides, or twenty amino acids.

I Imagine that a residue a occurs with probability qa, if protein or DNA sequence isdenoted x1...xn, then probability of the whole sequence is:

qx1qx2 ...qxn =nY

i=1

qxi

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 7: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

What are the Probabilistic Models?

I There are 2 basic definitions:

I Statistical analysis tool that estimates, on the basis of past (historical) data, theprobability of an event occurring again.

I Probabilistic model is a system that simulates the object under the considerationand produces different outcomes with different probabilities.

I Simple example - rolling a die.

I A bit more relevant example - random sequence model in DNA .

I Biological sequences are strings from a finite alphabet of residues, mostcommonly either four nucleotides, or twenty amino acids.

I Imagine that a residue a occurs with probability qa, if protein or DNA sequence isdenoted x1...xn, then probability of the whole sequence is:

qx1qx2 ...qxn =nY

i=1

qxi

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 8: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

What are the Probabilistic Models?

I There are 2 basic definitions:

I Statistical analysis tool that estimates, on the basis of past (historical) data, theprobability of an event occurring again.

I Probabilistic model is a system that simulates the object under the considerationand produces different outcomes with different probabilities.

I Simple example - rolling a die.

I A bit more relevant example - random sequence model in DNA .

I Biological sequences are strings from a finite alphabet of residues, mostcommonly either four nucleotides, or twenty amino acids.

I Imagine that a residue a occurs with probability qa, if protein or DNA sequence isdenoted x1...xn, then probability of the whole sequence is:

qx1qx2 ...qxn =nY

i=1

qxi

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 9: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Sequence Alignment

I Sequence alignment is a way of arranging the sequences of DNA, RNA, or proteinto identify regions of similarity that may be a consequence of functional,structural, or evolutionary relationships between the sequences.

I A variety of computational algorithms have been applied to the sequencealignment problem, i.e. dynamic programming, heuristic algorithms, probabilisticmethods.

I Common formats for representing alignments are FASTA and GenBank format

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 10: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Sequence Alignment

I Sequence alignment is a way of arranging the sequences of DNA, RNA, or proteinto identify regions of similarity that may be a consequence of functional,structural, or evolutionary relationships between the sequences.

I A variety of computational algorithms have been applied to the sequencealignment problem, i.e. dynamic programming, heuristic algorithms, probabilisticmethods.

I Common formats for representing alignments are FASTA and GenBank format

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 11: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Sequence Alignment

I Sequence alignment is a way of arranging the sequences of DNA, RNA, or proteinto identify regions of similarity that may be a consequence of functional,structural, or evolutionary relationships between the sequences.

I A variety of computational algorithms have been applied to the sequencealignment problem, i.e. dynamic programming, heuristic algorithms, probabilisticmethods.

I Common formats for representing alignments are FASTA and GenBank format

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 12: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 13: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Pairwise Alignment

I Pairwise sequence alignment methods are used to find the best-matchingpiecewise (local) or global alignments of two query sequences.

I The three primary methods of producing pairwise alignments are dot-matrixmethods, dynamic programming, and word methods.

I Needleman-Wunsch algorithm (Global Alignment)

I Smith-Waterman algorithm (Local Alignment)

I FASTA/BLAST Algorithms (k-tuple heuristic methods, often combined withdynamic models)

I Gap Penalities - modeling a cost of a gap in matched sequences (linear, affine,etc.)

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 14: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Pairwise Alignment

I Pairwise sequence alignment methods are used to find the best-matchingpiecewise (local) or global alignments of two query sequences.

I The three primary methods of producing pairwise alignments are dot-matrixmethods, dynamic programming, and word methods.

I Needleman-Wunsch algorithm (Global Alignment)

I Smith-Waterman algorithm (Local Alignment)

I FASTA/BLAST Algorithms (k-tuple heuristic methods, often combined withdynamic models)

I Gap Penalities - modeling a cost of a gap in matched sequences (linear, affine,etc.)

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 15: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Pairwise Alignment

I Pairwise sequence alignment methods are used to find the best-matchingpiecewise (local) or global alignments of two query sequences.

I The three primary methods of producing pairwise alignments are dot-matrixmethods, dynamic programming, and word methods.

I Needleman-Wunsch algorithm (Global Alignment)

I Smith-Waterman algorithm (Local Alignment)

I FASTA/BLAST Algorithms (k-tuple heuristic methods, often combined withdynamic models)

I Gap Penalities - modeling a cost of a gap in matched sequences (linear, affine,etc.)

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 16: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Pairwise Alignment

I Pairwise sequence alignment methods are used to find the best-matchingpiecewise (local) or global alignments of two query sequences.

I The three primary methods of producing pairwise alignments are dot-matrixmethods, dynamic programming, and word methods.

I Needleman-Wunsch algorithm (Global Alignment)

I Smith-Waterman algorithm (Local Alignment)

I FASTA/BLAST Algorithms (k-tuple heuristic methods, often combined withdynamic models)

I Gap Penalities - modeling a cost of a gap in matched sequences (linear, affine,etc.)

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 17: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Pairwise Alignment

I Pairwise sequence alignment methods are used to find the best-matchingpiecewise (local) or global alignments of two query sequences.

I The three primary methods of producing pairwise alignments are dot-matrixmethods, dynamic programming, and word methods.

I Needleman-Wunsch algorithm (Global Alignment)

I Smith-Waterman algorithm (Local Alignment)

I FASTA/BLAST Algorithms (k-tuple heuristic methods, often combined withdynamic models)

I Gap Penalities - modeling a cost of a gap in matched sequences (linear, affine,etc.)

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 18: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Pairwise Alignment

I Pairwise sequence alignment methods are used to find the best-matchingpiecewise (local) or global alignments of two query sequences.

I The three primary methods of producing pairwise alignments are dot-matrixmethods, dynamic programming, and word methods.

I Needleman-Wunsch algorithm (Global Alignment)

I Smith-Waterman algorithm (Local Alignment)

I FASTA/BLAST Algorithms (k-tuple heuristic methods, often combined withdynamic models)

I Gap Penalities - modeling a cost of a gap in matched sequences (linear, affine,etc.)

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 19: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

I Example - Smith-Waterman: A matrix H is built as follows:

H(i , 0) = 0, 0 ≤ i ≤ m

H(0, j) = 0, 0 ≤ j ≤ n

if ai = bj then w(ai , bj ) = w(match)

or if ai ! = bj then w(ai , bj ) = w(mismatch)

H(i , j) = max

8>><>>:0

H(i − 1, j − 1) + w(ai , bj ) Match/MismatchH(i − 1, j) + w(ai ,−) DeletionH(i , j − 1) + w(−, bj ) Insertion

9>>=>>; , 1 ≤ i ≤ m, 1 ≤ j ≤ n

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 20: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

I Sequence 1 = ACACACTA, Sequence 2 = AGCACACA

I w(match) = +2

I w(a,-) = w(-,b) = w(mismatch) = -1

H =

0BBBBBBBBBBBBB@

− A C A C A C T A− 0 0 0 0 0 0 0 0 0A 0 2 1 2 1 2 1 0 2G 0 1 1 1 1 1 1 0 1C 0 0 3 2 3 2 3 2 1A 0 2 2 5 4 5 4 3 4C 0 1 4 4 7 6 7 6 5A 0 2 3 6 6 9 8 7 8C 0 1 4 5 8 8 11 10 9A 0 2 3 6 7 10 10 10 12

1CCCCCCCCCCCCCAI In the example, the highest value corresponds to the cell in position (8,8). The

walk back corresponds to (8,8), (7,7), (7,6), (6,5), (5,4), (4,3), (3,2), (2,1),(1,1), and (0,0)

I Sequence 1 = A-CACACTA, Sequence 2 = AGCACAC-A

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 21: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

I Sequence 1 = ACACACTA, Sequence 2 = AGCACACA

I w(match) = +2

I w(a,-) = w(-,b) = w(mismatch) = -1

H =

0BBBBBBBBBBBBB@

− A C A C A C T A− 0 0 0 0 0 0 0 0 0A 0 2 1 2 1 2 1 0 2G 0 1 1 1 1 1 1 0 1C 0 0 3 2 3 2 3 2 1A 0 2 2 5 4 5 4 3 4C 0 1 4 4 7 6 7 6 5A 0 2 3 6 6 9 8 7 8C 0 1 4 5 8 8 11 10 9A 0 2 3 6 7 10 10 10 12

1CCCCCCCCCCCCCA

I In the example, the highest value corresponds to the cell in position (8,8). Thewalk back corresponds to (8,8), (7,7), (7,6), (6,5), (5,4), (4,3), (3,2), (2,1),(1,1), and (0,0)

I Sequence 1 = A-CACACTA, Sequence 2 = AGCACAC-A

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 22: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

I Sequence 1 = ACACACTA, Sequence 2 = AGCACACA

I w(match) = +2

I w(a,-) = w(-,b) = w(mismatch) = -1

H =

0BBBBBBBBBBBBB@

− A C A C A C T A− 0 0 0 0 0 0 0 0 0A 0 2 1 2 1 2 1 0 2G 0 1 1 1 1 1 1 0 1C 0 0 3 2 3 2 3 2 1A 0 2 2 5 4 5 4 3 4C 0 1 4 4 7 6 7 6 5A 0 2 3 6 6 9 8 7 8C 0 1 4 5 8 8 11 10 9A 0 2 3 6 7 10 10 10 12

1CCCCCCCCCCCCCAI In the example, the highest value corresponds to the cell in position (8,8). The

walk back corresponds to (8,8), (7,7), (7,6), (6,5), (5,4), (4,3), (3,2), (2,1),(1,1), and (0,0)

I Sequence 1 = A-CACACTA, Sequence 2 = AGCACAC-A

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 23: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Multiple Sequence Alignment Models

I A multiple sequence alignment (MSA) is a sequence alignment of three or morebiological sequences, commonly protein, DNA, or RNA.

I We usually want to do multiple alignments to find a homologous sequences thatpoint to a shared evolutionary origins that can be used for further phylogeneticanalysis.

I Progressive Alignment Methods - constructing succession of a pairwise alignment.

I Hidden Markov Models - representation of MSA as DAG, observed states areindividual alignment columns and the hidden states represent the presumedancestral sequence.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 24: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Multiple Sequence Alignment Models

I A multiple sequence alignment (MSA) is a sequence alignment of three or morebiological sequences, commonly protein, DNA, or RNA.

I We usually want to do multiple alignments to find a homologous sequences thatpoint to a shared evolutionary origins that can be used for further phylogeneticanalysis.

I Progressive Alignment Methods - constructing succession of a pairwise alignment.

I Hidden Markov Models - representation of MSA as DAG, observed states areindividual alignment columns and the hidden states represent the presumedancestral sequence.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 25: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Multiple Sequence Alignment Models

I A multiple sequence alignment (MSA) is a sequence alignment of three or morebiological sequences, commonly protein, DNA, or RNA.

I We usually want to do multiple alignments to find a homologous sequences thatpoint to a shared evolutionary origins that can be used for further phylogeneticanalysis.

I Progressive Alignment Methods - constructing succession of a pairwise alignment.

I Hidden Markov Models - representation of MSA as DAG, observed states areindividual alignment columns and the hidden states represent the presumedancestral sequence.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 26: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 27: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

What is Phylogenetics?

I Phylogenetics is the study of evolutionary relatedness among groups of organisms(e.g. species, populations), which is discovered through molecular sequencingdata and morphological data matrices.

I Evolution is regarded as a branching process, whereby populations are alteredover time and may speciate into separate branches, hybridize together, orterminate by extinction. This may be visualized in a phylogenetic tree.

I Ernst Haeckel’s recapitulation theory (”ontogeny recapitulates phylogeny”) is ahypothesis that in developing from embryo to adult, animals go through stagesresembling or representing successive stages in the evolution of their remoteancestors.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 28: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

What is Phylogenetics?

I Phylogenetics is the study of evolutionary relatedness among groups of organisms(e.g. species, populations), which is discovered through molecular sequencingdata and morphological data matrices.

I Evolution is regarded as a branching process, whereby populations are alteredover time and may speciate into separate branches, hybridize together, orterminate by extinction. This may be visualized in a phylogenetic tree.

I Ernst Haeckel’s recapitulation theory (”ontogeny recapitulates phylogeny”) is ahypothesis that in developing from embryo to adult, animals go through stagesresembling or representing successive stages in the evolution of their remoteancestors.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 29: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

What is Phylogenetics?

I Phylogenetics is the study of evolutionary relatedness among groups of organisms(e.g. species, populations), which is discovered through molecular sequencingdata and morphological data matrices.

I Evolution is regarded as a branching process, whereby populations are alteredover time and may speciate into separate branches, hybridize together, orterminate by extinction. This may be visualized in a phylogenetic tree.

I Ernst Haeckel’s recapitulation theory (”ontogeny recapitulates phylogeny”) is ahypothesis that in developing from embryo to adult, animals go through stagesresembling or representing successive stages in the evolution of their remoteancestors.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 30: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Building Phylogenetic Trees

I Phylogenetic trees among a nontrivial number of input sequences are constructedusing computational phylogenetics methods.

I Common method is to search for maximum likelihood, often within a BayesianFramework, and apply an explicit model of evolution to phylogenetic treeestimation.

I Identifying the optimal tree using many of these techniques is NP-hard, soheuristic search and optimization methods are used in combination withtree-scoring functions to identify a reasonably good tree that fits the data.

I They do not necessarily accurately represent the species evolutionary history asthe data on which they are based is noisy; the analysis can be confounded byhorizontal gene transfer, hybridisation between species that were not nearestneighbors on the tree before hybridisation takes place, convergent evolution, andconserved sequences.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 31: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Building Phylogenetic Trees

I Phylogenetic trees among a nontrivial number of input sequences are constructedusing computational phylogenetics methods.

I Common method is to search for maximum likelihood, often within a BayesianFramework, and apply an explicit model of evolution to phylogenetic treeestimation.

I Identifying the optimal tree using many of these techniques is NP-hard, soheuristic search and optimization methods are used in combination withtree-scoring functions to identify a reasonably good tree that fits the data.

I They do not necessarily accurately represent the species evolutionary history asthe data on which they are based is noisy; the analysis can be confounded byhorizontal gene transfer, hybridisation between species that were not nearestneighbors on the tree before hybridisation takes place, convergent evolution, andconserved sequences.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 32: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Building Phylogenetic Trees

I Phylogenetic trees among a nontrivial number of input sequences are constructedusing computational phylogenetics methods.

I Common method is to search for maximum likelihood, often within a BayesianFramework, and apply an explicit model of evolution to phylogenetic treeestimation.

I Identifying the optimal tree using many of these techniques is NP-hard, soheuristic search and optimization methods are used in combination withtree-scoring functions to identify a reasonably good tree that fits the data.

I They do not necessarily accurately represent the species evolutionary history asthe data on which they are based is noisy; the analysis can be confounded byhorizontal gene transfer, hybridisation between species that were not nearestneighbors on the tree before hybridisation takes place, convergent evolution, andconserved sequences.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 33: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Building Phylogenetic Trees

I Phylogenetic trees among a nontrivial number of input sequences are constructedusing computational phylogenetics methods.

I Common method is to search for maximum likelihood, often within a BayesianFramework, and apply an explicit model of evolution to phylogenetic treeestimation.

I Identifying the optimal tree using many of these techniques is NP-hard, soheuristic search and optimization methods are used in combination withtree-scoring functions to identify a reasonably good tree that fits the data.

I They do not necessarily accurately represent the species evolutionary history asthe data on which they are based is noisy; the analysis can be confounded byhorizontal gene transfer, hybridisation between species that were not nearestneighbors on the tree before hybridisation takes place, convergent evolution, andconserved sequences.

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 34: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 35: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Other Models

I Transformational Grammars (Chomsky Hierarchy)

I RNA Structure Analysis Models (RNA contains the interactions - rather thanpreserving the sequence)

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics

Page 36: Introduction to Probabilistic Models for Bioinformatics

EVEN BRIDGESGENOMICS, LLC

Short introduction to BioinformaticsWhat are the Probabilistic Models?

Sequence AlignmentPairwise Alignment

Multiple Sequence Alignment ModelsWhat is Phylogenetics?

Building Phylogenetic TreesOther Models

Conctact Us

Contact Us

I We are Hiring!

Igor Bogicevic ([email protected]) Introduction to Probabilistic Models for Bioinformatics