49
“Multiple indexes and multiple alignments” Presenting: Siddharth Jonathan Scribing: Susan Tang DFLW: Neda Nategh Upcoming: 10/24: “Evolution of Multidomain Proteins” Wissam Kazan “Human Migrations” Anjalee Sujanani 10/26: “Comparison of Networks Across Species” Chuan Sheng Foo “Repetitive DNA Detection and Classification” Vijay Krishnan 10/1 9

“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

Page 1: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

“Multiple indexes and multiple alignments”

Presenting: Siddharth Jonathan

Scribing: Susan Tang

DFLW: Neda Nategh

Upcoming:

10/24: “Evolution of Multidomain Proteins” Wissam Kazan

“Human Migrations” Anjalee Sujanani

10/26: “Comparison of Networks Across Species” Chuan Sheng Foo

“Repetitive DNA Detection and Classification”Vijay Krishnan

10/19

Page 2: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 2

CS374Algorithms in Biology

Searching Biological Sequence Databases

Siddharth Jonathan

Page 3: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 3

Outline

• Background

• Problem

• Typhon Overview

• Typhon Components

• Results

Page 4: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 4

Background

• Sequence Alignment

• Multiple Alignment Databases

• Probabilistic Profile

• Phylogenetic Tree

Page 5: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 5

Sequence Alignment

• Identifying regions of similarity in the genome, proteins etc.

• Types– Global– Local

• Seeded• Non-seeded

• Why is it important?– Comparative analysis of genomes– Producing Phylogenetic trees– Understanding newly sequenced genomes

Page 6: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 6

Seeds – A Review

A seed, P = a set of ordered list of w positions

i.e. P = {x1, x2, …, xw} w = weight of P = |P|

s = span of P = xw – x1 + 1

Ex:P = {0, 1, 4, 5}

w = 4s = 5 – 0 + 1 = 6

Page 7: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 7

Indexing in Seeded Local Alignment algorithms

…G A T T A C C A G A T T A C C A G A T T A …

Gene Sequence S

Seed A = {0,1,2,3}

…G A T T A C C A G A T T A C C A G A T T A …

GATT S,0

…G A T T A C C A G A T T A C C A G A T T A …

ATTA S,1

The same idea holds for non-contiguous seeds as well!

Average number of seeds indexed per position is called the Budget

Page 8: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 8

Seeded Local Alignment Algorithms

• BLAST

• BLAT

• BLASTZ

• Exonerate

• Usage of multiple seeds, spaced seeds

• What do they have in common?

• Indexing!

Page 9: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 9

Multiple alignmentSpecies 1Species 2

Page 10: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 10

Phylogenetic Tree

Page 11: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 11

Probabilistic Profile

Each cell corresponds to one position in the alignment…

We’ll learn what information it carries very shortly!

Page 12: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 12

Regions

Page 13: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 13

The Problem

Say, we have a database of multiple alignments

So what’s the challenge?

Find local alignments for the query

Candidate seeds

Page 14: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 14

The Problem Statement

Budget

Can we do better?

Make use of information implicit in multiple alignment for

selecting which seeds to index for a given position

Page 15: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 15

The Problem Statement - Typhon

Given

Probabilistic Profile Candidate Seeds

Budget

Indexing Scheme that indexes only a subset of candidate seeds at each position

Page 16: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 16

Overall Architecture of Typhon

SJ
do graphics
Page 17: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 17

Step 1: Probabilistic Profile Construction

• 6 tuple for each position in the multiple alignment

• Ppresent – existence probability

• PA

• PC

• PT

• PG

• Pid – Probability that the corresponding query position has the consensus character

Conditional Probability that the homologous position contains A,C,T,G given that a homologous position exists.

Nucleotide with highest such value is called the consensus character

Page 18: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 18

Calculation of Probabilistic Profile

A

A

A

C

T

_

T

T

C

C

C

C

Human

Chimp

Rat

Pig

1

1

1

1

PPresent=100%

PA=75%

PC=25%

PG=0%

PT=0%

Propagation of values up the tree to the root is a tricky problem!

Page 19: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 19

Calculating probabilistic profile

• PPresent and PN calculated independently• PPresent Weighted average of children’s

PPresent values.• Weights proportional to the inverse of the

branch length• PN calculated through Felsentein’s

algorithm with a Kimura Matrix • Pid = max(PN) (This is calculated at the root)

Page 20: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 20

Overall Architecture of Typhon

SJ
do graphics
Page 21: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 21

Region Decomposition

ATTGGAACCCAGGCCA----AATT-GCGCC-----AA-TT------G----C-----ATGG-G-----ATGCCCAAAAAAT

ATTGGAACTCAGGCCA----AATT--CGCC-----AA-T-------G----C-----AT--G------ATGCCCATAAAAT

ATTGGAACCCAGGCCA----AATT-CG--C-----A-TT-------G----T-----A-GGG------ATGCCCAAAAAAT

ATTGGAACCCAGGCCA----A-TTGC-G-C-----AAT-T------G-----C----ATGGGG-----ATGCCCATAAAAT

1 2 3 2 1

Each region is characterized by a PPresent and a Pid

How do we come up with these regions?

Page 22: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 22

Hidden Markov Models (HMM)

Given an observation sequence

Predict the sequence of Hidden states

Page 23: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 23

Region Decomposition – Simple Method

• Come up with a set of region classes (states)

• Construct an HMM

• Looking at the observation sequence, try to determine the most likely parse– Viterbi algorithm

• Problem – Need to determine classes at the beginning

Page 24: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 24

Alternative

• Split the Profile into 2 classes at a time

• Use 2 stage HMM

• Stop until bound on number of region classes is reached

Page 25: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 25

Region Decomposition with HMM

Page 26: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 26

Overall Architecture of Typhon

SJ
do graphics
Page 27: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 27

Step 3: Seed Indexing

What are we trying to do?

1 2 1 3

A

B DCE

Candidate Seeds

A

DC

BC

A D C B C E D

Page 28: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 28

The Goal

• Maximize expected number of regions matched to a homologue

Page 29: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 29

Seed Assignment

• 2 Approaches:– General Method– Greedy Approximation

Page 30: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 30

General Method - Terminology

Region Classes

Size of the candidate set

Object[i][j]

i

j

Page 31: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 31

Calculation of number of matching regions(done for each cell in the previous table)

Probability that a region matches a homologue

Conditional Probability that the seeds match the region and its homologue

given that it exists

Number of regionsX X

‘PPresentPhit |C|

Page 32: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 32

General Method - Explained

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

Region Class 1

Region Class 2

Region Class 3

Region Class 4

Number of Candidate Seeds

1 2 3 4 5

Page 33: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 33

Some Terminology

• Weight– Total Length of all regions in a region class *

# of seeds indexed at each position– Sort of like the Budget for a region

• Value– Expected Number of Regions matched.

(previous calculation)

Page 34: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 34

Solving the Seed Assignment Problem

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

PPresent* P1

hit * |C|PPresent* P2

hit * |C|PPresent* P3

hit * |C|PPresent* P4

hit * |C|PPresent* P5

hit * |C|

Region Class 1

Region Class 2

Region Class 3

Region Class 4

Number of Candidate Seeds

1 2 3 4 5

Page 35: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 35

Solving the Seed Assignment Problem

Weight, Value

10,5

Weight, Value

20,30

Weight, Value

30,31

Weight, Value

40,34

Weight, Value

50,40

Weight, Value

15,8

Weight, Value

30,20

Weight, Value

45,22

Weight, Value

60,24

Weight, Value

75,30

Weight, Value

12,7

Weight, Value

24,10

Weight, Value

36,32

Weight, Value

48,36

Weight, Value

60,40

Weight, Value

9,9

Weight, Value

18,10

Weight, Value

27,25

Weight, Value

36,27

Weight, Value

5,30

Region Class 1

Region Class 2

Region Class 3

Region Class 4

Number of Candidate Seeds

1 2 3 4 5

Page 36: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 36

Solving the Seed Assignment ProblemBudget =112

Weight, Value

10,5

Weight, Value

20,30

Weight, Value

30,31

Weight, Value

40,34

Weight, Value

50,40

Weight, Value

15,8

Weight, Value

30,20

Weight, Value

45,22

Weight, Value

60,24

Weight, Value

75,30

Weight, Value

12,7

Weight, Value

24,10

Weight, Value

36,32

Weight, Value

48,36

Weight, Value

60,40

Weight, Value

9,9

Weight, Value

18,10

Weight, Value

27,25

Weight, Value

36,27

Weight, Value

5,30

Region Class 1

Region Class 2

Region Class 3

Region Class 4

Number of Candidate Seeds

1 2 3 4 5

Page 37: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 37

Looks Familiar?

• Closely related to the Knapsack Problem, a well studied problem in Computer Science

Page 38: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 38

Approximate Solution

• Faster

• Space Efficient

• New Terminology : – Density of an object = Value/Weight

Page 39: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 39

Approximate Solution – General Intuition

• Select objects in order of decreasing density

• Disallow more than one object per row

Page 40: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 40

Approximate Method in Action

Candidate Set

Object[1,1]

Density=V/W=3

Object[2,1]

Density=V/W=2

Object[3,1]

Density=V/W=5

Object[4,1]

Density=V/W=4

Object[3,2]

Density=V/W=6

What are the new values of Weight, Value and Density?

Value = additional number of regions matched

Weight = amount of budget used by this one seed.

And keep track of the Budget!

Page 41: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 41

Results

• Considerations– Sensitivity– Speed– Space

Page 42: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 42

Sensitivity Results

• Experimental Setup

• Detection of Hypothetical Homologous Alignments (HHA)

• Typhon Vs Standard

Page 43: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 43

Sensitivity Comparison

Page 44: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 44

Effect of Multiple Alignment on Sensitivity

Page 45: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 45

Running time Comparison

• Time spent building the index– Typhon takes longer

• Time spent scanning the index

• Typhon 3-4 times slower at run time which is reasonable

Page 46: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 46

Scanning time

Page 47: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 47

Conclusion

• Information implicit from Multiple Alignments helps search sensitivity

• Variable allocation of seeds by region classes helps (Typhon)

• Space and time complexities of Typhon comparable to STANDARD

• Most effective for queries far from each species in the alignment

Page 48: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 48

Questions?

Page 49: “Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain

CS374 Presentation - Searching Biological Sequence Databases 49

Acknowledgements

• Serafim Batzoglou , George Asimenos , Jason Flannick