“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang...

Preview:

Citation preview

“Multiple indexes and multiple alignments”

Presenting: Siddharth Jonathan

Scribing: Susan Tang

DFLW: Neda Nategh

Upcoming:

10/24: “Evolution of Multidomain Proteins” Wissam Kazan

“Human Migrations” Anjalee Sujanani

10/26: “Comparison of Networks Across Species” Chuan Sheng Foo

“Repetitive DNA Detection and Classification”Vijay Krishnan

10/19

CS374 Presentation - Searching Biological Sequence Databases 2

CS374Algorithms in Biology

Searching Biological Sequence Databases

Siddharth Jonathan

CS374 Presentation - Searching Biological Sequence Databases 3

Outline

• Background

• Problem

• Typhon Overview

• Typhon Components

• Results

CS374 Presentation - Searching Biological Sequence Databases 4

Background

• Sequence Alignment

• Multiple Alignment Databases

• Probabilistic Profile

• Phylogenetic Tree

CS374 Presentation - Searching Biological Sequence Databases 5

Sequence Alignment

• Identifying regions of similarity in the genome, proteins etc.

• Types– Global– Local

• Seeded• Non-seeded

• Why is it important?– Comparative analysis of genomes– Producing Phylogenetic trees– Understanding newly sequenced genomes

CS374 Presentation - Searching Biological Sequence Databases 6

Seeds – A Review

A seed, P = a set of ordered list of w positions

i.e. P = {x1, x2, …, xw} w = weight of P = |P|

s = span of P = xw – x1 + 1

Ex:P = {0, 1, 4, 5}

w = 4s = 5 – 0 + 1 = 6

CS374 Presentation - Searching Biological Sequence Databases 7

Indexing in Seeded Local Alignment algorithms

…G A T T A C C A G A T T A C C A G A T T A …

Gene Sequence S

Seed A = {0,1,2,3}

…G A T T A C C A G A T T A C C A G A T T A …

GATT S,0

…G A T T A C C A G A T T A C C A G A T T A …

ATTA S,1

The same idea holds for non-contiguous seeds as well!

Average number of seeds indexed per position is called the Budget

CS374 Presentation - Searching Biological Sequence Databases 8

Seeded Local Alignment Algorithms

• BLAST

• BLAT

• BLASTZ

• Exonerate

• Usage of multiple seeds, spaced seeds

• What do they have in common?

• Indexing!

CS374 Presentation - Searching Biological Sequence Databases 9

Multiple alignmentSpecies 1Species 2

CS374 Presentation - Searching Biological Sequence Databases 10

Phylogenetic Tree

CS374 Presentation - Searching Biological Sequence Databases 11

Probabilistic Profile

Each cell corresponds to one position in the alignment…

We’ll learn what information it carries very shortly!

CS374 Presentation - Searching Biological Sequence Databases 12

Regions

CS374 Presentation - Searching Biological Sequence Databases 13

The Problem

Say, we have a database of multiple alignments

So what’s the challenge?

Find local alignments for the query

Candidate seeds

CS374 Presentation - Searching Biological Sequence Databases 14

The Problem Statement

Budget

Can we do better?

Make use of information implicit in multiple alignment for

selecting which seeds to index for a given position

CS374 Presentation - Searching Biological Sequence Databases 15

The Problem Statement - Typhon

Given

Probabilistic Profile Candidate Seeds

Budget

Indexing Scheme that indexes only a subset of candidate seeds at each position

CS374 Presentation - Searching Biological Sequence Databases 16

Overall Architecture of Typhon

SJ
do graphics

CS374 Presentation - Searching Biological Sequence Databases 17

Step 1: Probabilistic Profile Construction

• 6 tuple for each position in the multiple alignment

• Ppresent – existence probability

• PA

• PC

• PT

• PG

• Pid – Probability that the corresponding query position has the consensus character

Conditional Probability that the homologous position contains A,C,T,G given that a homologous position exists.

Nucleotide with highest such value is called the consensus character

CS374 Presentation - Searching Biological Sequence Databases 18

Calculation of Probabilistic Profile

A

A

A

C

T

_

T

T

C

C

C

C

Human

Chimp

Rat

Pig

1

1

1

1

PPresent=100%

PA=75%

PC=25%

PG=0%

PT=0%

Propagation of values up the tree to the root is a tricky problem!

CS374 Presentation - Searching Biological Sequence Databases 19

Calculating probabilistic profile

• PPresent and PN calculated independently• PPresent Weighted average of children’s

PPresent values.• Weights proportional to the inverse of the

branch length• PN calculated through Felsentein’s

algorithm with a Kimura Matrix • Pid = max(PN) (This is calculated at the root)

CS374 Presentation - Searching Biological Sequence Databases 20

Overall Architecture of Typhon

SJ
do graphics

CS374 Presentation - Searching Biological Sequence Databases 21

Region Decomposition

ATTGGAACCCAGGCCA----AATT-GCGCC-----AA-TT------G----C-----ATGG-G-----ATGCCCAAAAAAT

ATTGGAACTCAGGCCA----AATT--CGCC-----AA-T-------G----C-----AT--G------ATGCCCATAAAAT

ATTGGAACCCAGGCCA----AATT-CG--C-----A-TT-------G----T-----A-GGG------ATGCCCAAAAAAT

ATTGGAACCCAGGCCA----A-TTGC-G-C-----AAT-T------G-----C----ATGGGG-----ATGCCCATAAAAT

1 2 3 2 1

Each region is characterized by a PPresent and a Pid

How do we come up with these regions?

CS374 Presentation - Searching Biological Sequence Databases 22

Hidden Markov Models (HMM)

Given an observation sequence

Predict the sequence of Hidden states

CS374 Presentation - Searching Biological Sequence Databases 23

Region Decomposition – Simple Method

• Come up with a set of region classes (states)

• Construct an HMM

• Looking at the observation sequence, try to determine the most likely parse– Viterbi algorithm

• Problem – Need to determine classes at the beginning

CS374 Presentation - Searching Biological Sequence Databases 24

Alternative

• Split the Profile into 2 classes at a time

• Use 2 stage HMM

• Stop until bound on number of region classes is reached

CS374 Presentation - Searching Biological Sequence Databases 25

Region Decomposition with HMM

CS374 Presentation - Searching Biological Sequence Databases 26

Overall Architecture of Typhon

SJ
do graphics

CS374 Presentation - Searching Biological Sequence Databases 27

Step 3: Seed Indexing

What are we trying to do?

1 2 1 3

A

B DCE

Candidate Seeds

A

DC

BC

A D C B C E D

CS374 Presentation - Searching Biological Sequence Databases 28

The Goal

• Maximize expected number of regions matched to a homologue

CS374 Presentation - Searching Biological Sequence Databases 29

Seed Assignment

• 2 Approaches:– General Method– Greedy Approximation

CS374 Presentation - Searching Biological Sequence Databases 30

General Method - Terminology

Region Classes

Size of the candidate set

Object[i][j]

i

j

CS374 Presentation - Searching Biological Sequence Databases 31

Calculation of number of matching regions(done for each cell in the previous table)

Probability that a region matches a homologue

Conditional Probability that the seeds match the region and its homologue

given that it exists

Number of regionsX X

‘PPresentPhit |C|

CS374 Presentation - Searching Biological Sequence Databases 32

General Method - Explained

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

Region Class 1

Region Class 2

Region Class 3

Region Class 4

Number of Candidate Seeds

1 2 3 4 5

CS374 Presentation - Searching Biological Sequence Databases 33

Some Terminology

• Weight– Total Length of all regions in a region class *

# of seeds indexed at each position– Sort of like the Budget for a region

• Value– Expected Number of Regions matched.

(previous calculation)

CS374 Presentation - Searching Biological Sequence Databases 34

Solving the Seed Assignment Problem

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

Region Class 1

Region Class 2

Region Class 3

Region Class 4

Number of Candidate Seeds

1 2 3 4 5

CS374 Presentation - Searching Biological Sequence Databases 35

Solving the Seed Assignment Problem

Weight, Value

10,5

Weight, Value

20,30

Weight, Value

30,31

Weight, Value

40,34

Weight, Value

50,40

Weight, Value

15,8

Weight, Value

30,20

Weight, Value

45,22

Weight, Value

60,24

Weight, Value

75,30

Weight, Value

12,7

Weight, Value

24,10

Weight, Value

36,32

Weight, Value

48,36

Weight, Value

60,40

Weight, Value

9,9

Weight, Value

18,10

Weight, Value

27,25

Weight, Value

36,27

Weight, Value

5,30

Region Class 1

Region Class 2

Region Class 3

Region Class 4

Number of Candidate Seeds

1 2 3 4 5

CS374 Presentation - Searching Biological Sequence Databases 36

Solving the Seed Assignment ProblemBudget =112

Weight, Value

10,5

Weight, Value

20,30

Weight, Value

30,31

Weight, Value

40,34

Weight, Value

50,40

Weight, Value

15,8

Weight, Value

30,20

Weight, Value

45,22

Weight, Value

60,24

Weight, Value

75,30

Weight, Value

12,7

Weight, Value

24,10

Weight, Value

36,32

Weight, Value

48,36

Weight, Value

60,40

Weight, Value

9,9

Weight, Value

18,10

Weight, Value

27,25

Weight, Value

36,27

Weight, Value

5,30

Region Class 1

Region Class 2

Region Class 3

Region Class 4

Number of Candidate Seeds

1 2 3 4 5

CS374 Presentation - Searching Biological Sequence Databases 37

Looks Familiar?

• Closely related to the Knapsack Problem, a well studied problem in Computer Science

CS374 Presentation - Searching Biological Sequence Databases 38

Approximate Solution

• Faster

• Space Efficient

• New Terminology : – Density of an object = Value/Weight

CS374 Presentation - Searching Biological Sequence Databases 39

Approximate Solution – General Intuition

• Select objects in order of decreasing density

• Disallow more than one object per row

CS374 Presentation - Searching Biological Sequence Databases 40

Approximate Method in Action

Candidate Set

Object[1,1]

Density=V/W=3

Object[2,1]

Density=V/W=2

Object[3,1]

Density=V/W=5

Object[4,1]

Density=V/W=4

Object[3,2]

Density=V/W=6

What are the new values of Weight, Value and Density?

Value = additional number of regions matched

Weight = amount of budget used by this one seed.

And keep track of the Budget!

CS374 Presentation - Searching Biological Sequence Databases 41

Results

• Considerations– Sensitivity– Speed– Space

CS374 Presentation - Searching Biological Sequence Databases 42

Sensitivity Results

• Experimental Setup

• Detection of Hypothetical Homologous Alignments (HHA)

• Typhon Vs Standard

CS374 Presentation - Searching Biological Sequence Databases 43

Sensitivity Comparison

CS374 Presentation - Searching Biological Sequence Databases 44

Effect of Multiple Alignment on Sensitivity

CS374 Presentation - Searching Biological Sequence Databases 45

Running time Comparison

• Time spent building the index– Typhon takes longer

• Time spent scanning the index

• Typhon 3-4 times slower at run time which is reasonable

CS374 Presentation - Searching Biological Sequence Databases 46

Scanning time

CS374 Presentation - Searching Biological Sequence Databases 47

Conclusion

• Information implicit from Multiple Alignments helps search sensitivity

• Variable allocation of seeds by region classes helps (Typhon)

• Space and time complexities of Typhon comparable to STANDARD

• Most effective for queries far from each species in the alignment

CS374 Presentation - Searching Biological Sequence Databases 48

Questions?

CS374 Presentation - Searching Biological Sequence Databases 49

Acknowledgements

• Serafim Batzoglou , George Asimenos , Jason Flannick

Recommended