“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

“Multiple indexes and multiple alignments”

Presenting: Siddharth Jonathan

Scribing: Susan Tang

DFLW: Neda Nategh

Upcoming:

10/24: “Evolution of Multidomain Proteins” Wissam Kazan

“Human Migrations” Anjalee Sujanani

10/26: “Comparison of Networks Across Species” Chuan Sheng Foo

“Repetitive DNA Detection and Classification”Vijay Krishnan

10/19

CS374 Presentation - Searching Biological Sequence Databases 2

CS374Algorithms in Biology

Searching Biological Sequence Databases

Siddharth Jonathan


Outline

• Background

• Problem

• Typhon Overview

• Typhon Components

• Results


Background

• Sequence Alignment

• Multiple Alignment Databases

• Probabilistic Profile

• Phylogenetic Tree


Sequence Alignment

• Identifying regions of similarity in the genome, proteins etc.

• Types– Global– Local

• Seeded• Non-seeded

• Why is it important?– Comparative analysis of genomes– Producing Phylogenetic trees– Understanding newly sequenced genomes


Seeds – A Review

A seed, P = a set of ordered list of w positions

i.e. P = {x1, x2, …, xw} w = weight of P = |P|

s = span of P = xw – x1 + 1

Ex:P = {0, 1, 4, 5}

w = 4s = 5 – 0 + 1 = 6


Indexing in Seeded Local Alignment algorithms

…G A T T A C C A G A T T A C C A G A T T A …

Gene Sequence S

Seed A = {0,1,2,3}


GATT S,0


ATTA S,1

The same idea holds for non-contiguous seeds as well!

Average number of seeds indexed per position is called the Budget


Seeded Local Alignment Algorithms

• BLAST

• BLAT

• BLASTZ

• Exonerate

• Usage of multiple seeds, spaced seeds

• What do they have in common?

• Indexing!


Multiple alignmentSpecies 1Species 2


Phylogenetic Tree


Probabilistic Profile

Each cell corresponds to one position in the alignment…

We’ll learn what information it carries very shortly!


Regions


The Problem

Say, we have a database of multiple alignments

So what’s the challenge?

Find local alignments for the query

Candidate seeds


The Problem Statement

Budget

Can we do better?

Make use of information implicit in multiple alignment for

selecting which seeds to index for a given position


The Problem Statement - Typhon

Given

Probabilistic Profile Candidate Seeds

Budget

Indexing Scheme that indexes only a subset of candidate seeds at each position


Overall Architecture of Typhon

SJ

do graphics


Step 1: Probabilistic Profile Construction

• 6 tuple for each position in the multiple alignment

• Ppresent – existence probability

• PA

• PC

• PT

• PG

• Pid – Probability that the corresponding query position has the consensus character

Conditional Probability that the homologous position contains A,C,T,G given that a homologous position exists.

Nucleotide with highest such value is called the consensus character


Calculation of Probabilistic Profile

A

A

A

C

T

_

T

T

C

C

C

C

Human

Chimp

Rat

Pig

1

1

1

1

PPresent=100%

PA=75%

PC=25%

PG=0%

PT=0%

Propagation of values up the tree to the root is a tricky problem!


Calculating probabilistic profile

• PPresent and PN calculated independently• PPresent Weighted average of children’s

PPresent values.• Weights proportional to the inverse of the

branch length• PN calculated through Felsentein’s

algorithm with a Kimura Matrix • Pid = max(PN) (This is calculated at the root)



SJ

do graphics


Region Decomposition

ATTGGAACCCAGGCCA----AATT-GCGCC-----AA-TT------G----C-----ATGG-G-----ATGCCCAAAAAAT

ATTGGAACTCAGGCCA----AATT--CGCC-----AA-T-------G----C-----AT--G------ATGCCCATAAAAT

ATTGGAACCCAGGCCA----AATT-CG--C-----A-TT-------G----T-----A-GGG------ATGCCCAAAAAAT

ATTGGAACCCAGGCCA----A-TTGC-G-C-----AAT-T------G-----C----ATGGGG-----ATGCCCATAAAAT

1 2 3 2 1

Each region is characterized by a PPresent and a Pid

How do we come up with these regions?


Hidden Markov Models (HMM)

Given an observation sequence

Predict the sequence of Hidden states


Region Decomposition – Simple Method

• Come up with a set of region classes (states)

• Construct an HMM

• Looking at the observation sequence, try to determine the most likely parse– Viterbi algorithm

• Problem – Need to determine classes at the beginning


Alternative

• Split the Profile into 2 classes at a time

• Use 2 stage HMM

• Stop until bound on number of region classes is reached


Region Decomposition with HMM



SJ

do graphics


Step 3: Seed Indexing

What are we trying to do?

1 2 1 3

A

B DCE

Candidate Seeds

A

DC

BC

A D C B C E D


The Goal

• Maximize expected number of regions matched to a homologue


Seed Assignment

• 2 Approaches:– General Method– Greedy Approximation


General Method - Terminology

Region Classes

Size of the candidate set

Object[i][j]

i

j


Calculation of number of matching regions(done for each cell in the previous table)

Probability that a region matches a homologue

Conditional Probability that the seeds match the region and its homologue

given that it exists

Number of regionsX X

‘PPresentPhit |C|


General Method - Explained

PPresent* P1

hit * |C|PPresent* P2




hit * |C|

PPresent* P1





hit * |C|

PPresent* P1





hit * |C|

PPresent* P1





hit * |C|

Region Class 1

Region Class 2

Region Class 3

Region Class 4

Number of Candidate Seeds

1 2 3 4 5


Some Terminology

• Weight– Total Length of all regions in a region class *

# of seeds indexed at each position– Sort of like the Budget for a region

• Value– Expected Number of Regions matched.

(previous calculation)


Solving the Seed Assignment Problem

PPresent* P1





hit * |C|

PPresent* P1





hit * |C|

PPresent* P1





hit * |C|

PPresent* P1





hit * |C|

Region Class 1

Region Class 2

Region Class 3

Region Class 4


1 2 3 4 5


Solving the Seed Assignment Problem

Weight, Value

10,5

Weight, Value

20,30

Weight, Value

30,31

Weight, Value

40,34

Weight, Value

50,40

Weight, Value

15,8

Weight, Value

30,20

Weight, Value

45,22

Weight, Value

60,24

Weight, Value

75,30

Weight, Value

12,7

Weight, Value

24,10

Weight, Value

36,32

Weight, Value

48,36

Weight, Value

60,40

Weight, Value

9,9

Weight, Value

18,10

Weight, Value

27,25

Weight, Value

36,27

Weight, Value

5,30

Region Class 1

Region Class 2

Region Class 3

Region Class 4


1 2 3 4 5


Solving the Seed Assignment ProblemBudget =112

Weight, Value

10,5

Weight, Value

20,30

Weight, Value

30,31

Weight, Value

40,34

Weight, Value

50,40

Weight, Value

15,8

Weight, Value

30,20

Weight, Value

45,22

Weight, Value

60,24

Weight, Value

75,30

Weight, Value

12,7

Weight, Value

24,10

Weight, Value

36,32

Weight, Value

48,36

Weight, Value

60,40

Weight, Value

9,9

Weight, Value

18,10

Weight, Value

27,25

Weight, Value

36,27

Weight, Value

5,30

Region Class 1

Region Class 2

Region Class 3

Region Class 4


1 2 3 4 5


Looks Familiar?

• Closely related to the Knapsack Problem, a well studied problem in Computer Science


Approximate Solution

• Faster

• Space Efficient

• New Terminology : – Density of an object = Value/Weight


Approximate Solution – General Intuition

• Select objects in order of decreasing density

• Disallow more than one object per row


Approximate Method in Action

Candidate Set

Object[1,1]

Density=V/W=3

Object[2,1]

Density=V/W=2

Object[3,1]

Density=V/W=5

Object[4,1]

Density=V/W=4

Object[3,2]

Density=V/W=6

What are the new values of Weight, Value and Density?

Value = additional number of regions matched

Weight = amount of budget used by this one seed.

And keep track of the Budget!


Results

• Considerations– Sensitivity– Speed– Space


Sensitivity Results

• Experimental Setup

• Detection of Hypothetical Homologous Alignments (HHA)

• Typhon Vs Standard


Sensitivity Comparison


Effect of Multiple Alignment on Sensitivity


Running time Comparison

• Time spent building the index– Typhon takes longer

• Time spent scanning the index

• Typhon 3-4 times slower at run time which is reasonable


Scanning time


Conclusion

• Information implicit from Multiple Alignments helps search sensitivity

• Variable allocation of seeds by region classes helps (Typhon)

• Space and time complexities of Typhon comparable to STANDARD

• Most effective for queries far from each species in the alignment


Questions?


Acknowledgements

• Serafim Batzoglou , George Asimenos , Jason Flannick

Documents

“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain