View
218
Download
2
Tags:
Embed Size (px)
Citation preview
“Multiple indexes and multiple alignments”
Presenting: Siddharth Jonathan
Scribing: Susan Tang
DFLW: Neda Nategh
Upcoming:
10/24: “Evolution of Multidomain Proteins” Wissam Kazan
“Human Migrations” Anjalee Sujanani
10/26: “Comparison of Networks Across Species” Chuan Sheng Foo
“Repetitive DNA Detection and Classification”Vijay Krishnan
10/19
CS374 Presentation - Searching Biological Sequence Databases 2
CS374Algorithms in Biology
Searching Biological Sequence Databases
Siddharth Jonathan
CS374 Presentation - Searching Biological Sequence Databases 3
Outline
• Background
• Problem
• Typhon Overview
• Typhon Components
• Results
CS374 Presentation - Searching Biological Sequence Databases 4
Background
• Sequence Alignment
• Multiple Alignment Databases
• Probabilistic Profile
• Phylogenetic Tree
CS374 Presentation - Searching Biological Sequence Databases 5
Sequence Alignment
• Identifying regions of similarity in the genome, proteins etc.
• Types– Global– Local
• Seeded• Non-seeded
• Why is it important?– Comparative analysis of genomes– Producing Phylogenetic trees– Understanding newly sequenced genomes
CS374 Presentation - Searching Biological Sequence Databases 6
Seeds – A Review
A seed, P = a set of ordered list of w positions
i.e. P = {x1, x2, …, xw} w = weight of P = |P|
s = span of P = xw – x1 + 1
Ex:P = {0, 1, 4, 5}
w = 4s = 5 – 0 + 1 = 6
CS374 Presentation - Searching Biological Sequence Databases 7
Indexing in Seeded Local Alignment algorithms
…G A T T A C C A G A T T A C C A G A T T A …
Gene Sequence S
Seed A = {0,1,2,3}
…G A T T A C C A G A T T A C C A G A T T A …
GATT S,0
…G A T T A C C A G A T T A C C A G A T T A …
ATTA S,1
The same idea holds for non-contiguous seeds as well!
Average number of seeds indexed per position is called the Budget
CS374 Presentation - Searching Biological Sequence Databases 8
Seeded Local Alignment Algorithms
• BLAST
• BLAT
• BLASTZ
• Exonerate
• Usage of multiple seeds, spaced seeds
• What do they have in common?
• Indexing!
CS374 Presentation - Searching Biological Sequence Databases 9
Multiple alignmentSpecies 1Species 2
CS374 Presentation - Searching Biological Sequence Databases 10
Phylogenetic Tree
CS374 Presentation - Searching Biological Sequence Databases 11
Probabilistic Profile
Each cell corresponds to one position in the alignment…
We’ll learn what information it carries very shortly!
CS374 Presentation - Searching Biological Sequence Databases 12
Regions
CS374 Presentation - Searching Biological Sequence Databases 13
The Problem
Say, we have a database of multiple alignments
So what’s the challenge?
Find local alignments for the query
Candidate seeds
CS374 Presentation - Searching Biological Sequence Databases 14
The Problem Statement
Budget
Can we do better?
Make use of information implicit in multiple alignment for
selecting which seeds to index for a given position
CS374 Presentation - Searching Biological Sequence Databases 15
The Problem Statement - Typhon
Given
Probabilistic Profile Candidate Seeds
Budget
Indexing Scheme that indexes only a subset of candidate seeds at each position
CS374 Presentation - Searching Biological Sequence Databases 16
Overall Architecture of Typhon
CS374 Presentation - Searching Biological Sequence Databases 17
Step 1: Probabilistic Profile Construction
• 6 tuple for each position in the multiple alignment
• Ppresent – existence probability
• PA
• PC
• PT
• PG
• Pid – Probability that the corresponding query position has the consensus character
Conditional Probability that the homologous position contains A,C,T,G given that a homologous position exists.
Nucleotide with highest such value is called the consensus character
CS374 Presentation - Searching Biological Sequence Databases 18
Calculation of Probabilistic Profile
A
A
A
C
T
_
T
T
C
C
C
C
Human
Chimp
Rat
Pig
1
1
1
1
PPresent=100%
PA=75%
PC=25%
PG=0%
PT=0%
Propagation of values up the tree to the root is a tricky problem!
CS374 Presentation - Searching Biological Sequence Databases 19
Calculating probabilistic profile
• PPresent and PN calculated independently• PPresent Weighted average of children’s
PPresent values.• Weights proportional to the inverse of the
branch length• PN calculated through Felsentein’s
algorithm with a Kimura Matrix • Pid = max(PN) (This is calculated at the root)
CS374 Presentation - Searching Biological Sequence Databases 20
Overall Architecture of Typhon
CS374 Presentation - Searching Biological Sequence Databases 21
Region Decomposition
ATTGGAACCCAGGCCA----AATT-GCGCC-----AA-TT------G----C-----ATGG-G-----ATGCCCAAAAAAT
ATTGGAACTCAGGCCA----AATT--CGCC-----AA-T-------G----C-----AT--G------ATGCCCATAAAAT
ATTGGAACCCAGGCCA----AATT-CG--C-----A-TT-------G----T-----A-GGG------ATGCCCAAAAAAT
ATTGGAACCCAGGCCA----A-TTGC-G-C-----AAT-T------G-----C----ATGGGG-----ATGCCCATAAAAT
1 2 3 2 1
Each region is characterized by a PPresent and a Pid
How do we come up with these regions?
CS374 Presentation - Searching Biological Sequence Databases 22
Hidden Markov Models (HMM)
Given an observation sequence
Predict the sequence of Hidden states
CS374 Presentation - Searching Biological Sequence Databases 23
Region Decomposition – Simple Method
• Come up with a set of region classes (states)
• Construct an HMM
• Looking at the observation sequence, try to determine the most likely parse– Viterbi algorithm
• Problem – Need to determine classes at the beginning
CS374 Presentation - Searching Biological Sequence Databases 24
Alternative
• Split the Profile into 2 classes at a time
• Use 2 stage HMM
• Stop until bound on number of region classes is reached
CS374 Presentation - Searching Biological Sequence Databases 25
Region Decomposition with HMM
CS374 Presentation - Searching Biological Sequence Databases 26
Overall Architecture of Typhon
CS374 Presentation - Searching Biological Sequence Databases 27
Step 3: Seed Indexing
What are we trying to do?
1 2 1 3
A
B DCE
Candidate Seeds
A
DC
BC
A D C B C E D
CS374 Presentation - Searching Biological Sequence Databases 28
The Goal
• Maximize expected number of regions matched to a homologue
CS374 Presentation - Searching Biological Sequence Databases 29
Seed Assignment
• 2 Approaches:– General Method– Greedy Approximation
CS374 Presentation - Searching Biological Sequence Databases 30
General Method - Terminology
Region Classes
Size of the candidate set
Object[i][j]
i
j
CS374 Presentation - Searching Biological Sequence Databases 31
Calculation of number of matching regions(done for each cell in the previous table)
Probability that a region matches a homologue
Conditional Probability that the seeds match the region and its homologue
given that it exists
Number of regionsX X
‘PPresentPhit |C|
CS374 Presentation - Searching Biological Sequence Databases 32
General Method - Explained
PPresent* P1
hit * |C|PPresent* P2
hit * |C|PPresent* P3
hit * |C|PPresent* P4
hit * |C|PPresent* P5
hit * |C|
PPresent* P1
hit * |C|PPresent* P2
hit * |C|PPresent* P3
hit * |C|PPresent* P4
hit * |C|PPresent* P5
hit * |C|
PPresent* P1
hit * |C|PPresent* P2
hit * |C|PPresent* P3
hit * |C|PPresent* P4
hit * |C|PPresent* P5
hit * |C|
PPresent* P1
hit * |C|PPresent* P2
hit * |C|PPresent* P3
hit * |C|PPresent* P4
hit * |C|PPresent* P5
hit * |C|
Region Class 1
Region Class 2
Region Class 3
Region Class 4
Number of Candidate Seeds
1 2 3 4 5
CS374 Presentation - Searching Biological Sequence Databases 33
Some Terminology
• Weight– Total Length of all regions in a region class *
# of seeds indexed at each position– Sort of like the Budget for a region
• Value– Expected Number of Regions matched.
(previous calculation)
CS374 Presentation - Searching Biological Sequence Databases 34
Solving the Seed Assignment Problem
PPresent* P1
hit * |C|PPresent* P2
hit * |C|PPresent* P3
hit * |C|PPresent* P4
hit * |C|PPresent* P5
hit * |C|
PPresent* P1
hit * |C|PPresent* P2
hit * |C|PPresent* P3
hit * |C|PPresent* P4
hit * |C|PPresent* P5
hit * |C|
PPresent* P1
hit * |C|PPresent* P2
hit * |C|PPresent* P3
hit * |C|PPresent* P4
hit * |C|PPresent* P5
hit * |C|
PPresent* P1
hit * |C|PPresent* P2
hit * |C|PPresent* P3
hit * |C|PPresent* P4
hit * |C|PPresent* P5
hit * |C|
Region Class 1
Region Class 2
Region Class 3
Region Class 4
Number of Candidate Seeds
1 2 3 4 5
CS374 Presentation - Searching Biological Sequence Databases 35
Solving the Seed Assignment Problem
Weight, Value
10,5
Weight, Value
20,30
Weight, Value
30,31
Weight, Value
40,34
Weight, Value
50,40
Weight, Value
15,8
Weight, Value
30,20
Weight, Value
45,22
Weight, Value
60,24
Weight, Value
75,30
Weight, Value
12,7
Weight, Value
24,10
Weight, Value
36,32
Weight, Value
48,36
Weight, Value
60,40
Weight, Value
9,9
Weight, Value
18,10
Weight, Value
27,25
Weight, Value
36,27
Weight, Value
5,30
Region Class 1
Region Class 2
Region Class 3
Region Class 4
Number of Candidate Seeds
1 2 3 4 5
CS374 Presentation - Searching Biological Sequence Databases 36
Solving the Seed Assignment ProblemBudget =112
Weight, Value
10,5
Weight, Value
20,30
Weight, Value
30,31
Weight, Value
40,34
Weight, Value
50,40
Weight, Value
15,8
Weight, Value
30,20
Weight, Value
45,22
Weight, Value
60,24
Weight, Value
75,30
Weight, Value
12,7
Weight, Value
24,10
Weight, Value
36,32
Weight, Value
48,36
Weight, Value
60,40
Weight, Value
9,9
Weight, Value
18,10
Weight, Value
27,25
Weight, Value
36,27
Weight, Value
5,30
Region Class 1
Region Class 2
Region Class 3
Region Class 4
Number of Candidate Seeds
1 2 3 4 5
CS374 Presentation - Searching Biological Sequence Databases 37
Looks Familiar?
• Closely related to the Knapsack Problem, a well studied problem in Computer Science
CS374 Presentation - Searching Biological Sequence Databases 38
Approximate Solution
• Faster
• Space Efficient
• New Terminology : – Density of an object = Value/Weight
CS374 Presentation - Searching Biological Sequence Databases 39
Approximate Solution – General Intuition
• Select objects in order of decreasing density
• Disallow more than one object per row
CS374 Presentation - Searching Biological Sequence Databases 40
Approximate Method in Action
Candidate Set
Object[1,1]
Density=V/W=3
Object[2,1]
Density=V/W=2
Object[3,1]
Density=V/W=5
Object[4,1]
Density=V/W=4
Object[3,2]
Density=V/W=6
What are the new values of Weight, Value and Density?
Value = additional number of regions matched
Weight = amount of budget used by this one seed.
And keep track of the Budget!
CS374 Presentation - Searching Biological Sequence Databases 41
Results
• Considerations– Sensitivity– Speed– Space
CS374 Presentation - Searching Biological Sequence Databases 42
Sensitivity Results
• Experimental Setup
• Detection of Hypothetical Homologous Alignments (HHA)
• Typhon Vs Standard
CS374 Presentation - Searching Biological Sequence Databases 43
Sensitivity Comparison
CS374 Presentation - Searching Biological Sequence Databases 44
Effect of Multiple Alignment on Sensitivity
CS374 Presentation - Searching Biological Sequence Databases 45
Running time Comparison
• Time spent building the index– Typhon takes longer
• Time spent scanning the index
• Typhon 3-4 times slower at run time which is reasonable
CS374 Presentation - Searching Biological Sequence Databases 46
Scanning time
CS374 Presentation - Searching Biological Sequence Databases 47
Conclusion
• Information implicit from Multiple Alignments helps search sensitivity
• Variable allocation of seeds by region classes helps (Typhon)
• Space and time complexities of Typhon comparable to STANDARD
• Most effective for queries far from each species in the alignment
CS374 Presentation - Searching Biological Sequence Databases 48
Questions?
CS374 Presentation - Searching Biological Sequence Databases 49
Acknowledgements
• Serafim Batzoglou , George Asimenos , Jason Flannick