Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007

Indexing DNA sequences for local similarity search

Joint work of Angela,Dr. Mamoulis and Dr. Yiu

17/5/2007

Outline Introduction

DNA sequences Local similarity search

Related works BLAST Prefix-suffix hashing scheme Experimental result Conclusion Future work

DNA sequences DNA exists in chromosomes of organisms Genome is all DNA in an organism Composed of 4 nucleotides A, C, G, T Human has 23 pairs of chromosomes that a

mount to 3 billion bp Public DNA databases contains genomes of

organisms and their information

DNA Similarity DNA sequences contain special region

s, eg. Genes, motifs Some regions conserve across species Similar regions may imply similar func

tions and structures Given a sequence being studied (quer

y), search for regions in the database sequences

Similarity measurement Σ = {A, C, G, T} Sequence alignment

Align sequences S and T Insert spaces in S and T to form S’ and T’

Scoring matrix σ Match/mismatch scoring Let x and y be two aligned characters or space

from two sequences, x, y Σ {space}R if x = y and x ≠ space

σ(x, y) = P if x ≠ y-∞ if x = y = space

where R (reward) is positive and P (penalty) is negative

Gap penalty

Gap = a maximal subsequence of spaces in an alignment

Affine gap penalty Wg + qW s

where Wg and Ws are constants, Wg 0, Ws 0 and q 1 is the gap length

Penalty of a length-q Gap < Penalty of q deletions/insertions

DNA sequence alignments Global alignment

Needleman-Wunsch algorithm (1970) A C – G T T C A A C C G – – G A

Local alignment Smith-Waterman algorithm (1981) A C C G T A G C A C G T – C C A T A – – A C G –

Dynamic programming Optimal solution

Time and space complexity O(mn), m and n are the lengths of the two sequences

Global alignment Input: two sequences S and T Output: alignment of S and T with the highest score V(i, j): the optimal score to align S[1..i] and T[1..j]

Basis:V(0, 0) = 0,V(i, 0) = i,V(0, j) = j

Recurrence:V(i, j) = max of{

V(i-1, j-1) + σ(S[i], T[j]),V(i-1, j) + σ(S[i], –),V(i, j-1) + σ(–, T[j])

}

Local alignment Input: two sequences S and T Output:

Substring A from S Substring B from T Score of the optimal (global) alignment of A and B

V(i, j): the optimal score to align subsequences of S ending at i and T ending at jBasis:

V(i, 0) = 0,V(0, j) = 0

Recurrence:V(i, j) = max of{

0V(i-1, j-1) + σ(S[i], T[j]),V(i-1, j) + σ(S[i], –),V(i, j-1) + σ(–, T[j])

}

Local similarity search

Input Two DNA sequences

Output The alignments of the regions from

the two sequences that score higher than a score threshold

Database search Input

A query sequence and a sequence database Output

The local similarity search results between pairs of database sequence and the query sequence

Objective: Perform local similarity search fast Maintain search sensitivity

BLAST Basic Local Alignment Search Tool By NCBI (National Center for

Biotechnology Information) of the US Government

Finds regions of local similarity between sequences (DNA, RNA or proteins)

Applies heuristics – fast Applies statistical theory – relatively

accurate

Sample BLAST result Score = 44.4 bits (27), Expect = 0.013 Identities = 37/47 (78%) Strand = Plus / Minus

Query: 6 caggggtccaggcccccagcccctctcctgggcccctcaccccgcgg 52 ||||||||| ||||||||||| ||||| ||| || | ||||||Sbjct: 199635477 caggggtccccgcccccagcccagctcctcggcaccccgggccgcgg 199635431

Score = 44.4 bits (27), Expect = 0.013 Identities = 35/43 (81%) Strand = Plus / Minus

Query: 333 ccccgtttctcggatggaaaaactgaggctccgaaagcagaag 375 |||| |||| | ||||||||||||||||| | || || ||||Sbjct: 505025625 ccccatttcacagatggaaaaactgaggcccagagagaggaag 505025583

Sample BLAST result

Matrix: blastn matrix:1 -1Gap Penalties: Existence: 5, Extension: 2Number of Sequences: 1Number of Hits to DB: 2,526,608Number of extensions: 138741Number of successful extensions: 27Number of sequences better than 1.0: 1Number of HSP's gapped: 27Number of HSP's successfully gapped: 27Length of query: 375Length of database: 880,975,758Length adjustment: 44Effective length of query: 331Effective length of database: 880,975,714Effective search space: 291602961334Effective search space used: 291602961334

How BLAST works Split a search into phases

Hit generation Ungapped extension Gapped extension Traceback

Configurable parameters Word length W Match reward R Mismatch penalty P Cutoff score S Dropoff score X E-value threshold E

Hit generation Word hits (length W, default = 11) Database sequences are compressed:

A = 00, C = 01, G = 10, T = 11 Compression factor = 4

Build a lookup table on sliding windows of the query sequence 4-sliding window of length 8

Scan the compressed database sequence for exact matches present in the lookup table Extend the exact matches of length 8 to W

Ungapped extension Extend the word hits to both directions until the score drops X or

more The extended hit is qualified if it scores higher than cutoff score S Example: X = 2, S = 3 Query: A T A C G T A C G T A C G T DB seq:G C A C G T A C G C G T

1 1 1 1 1 1 score=6 1 1 1 1 1 1 1 score=7 (drop

-1) -2 1 1 1 1 1 1 1 score=5 (drop

2) -2 1 1 1 1 1 1 1 -2 score=3 (drop

2) Extended hit = CACGTACGC

Gapped extension + traceback Extend the hits on both directions Allow gaps Perform restricted dynamic

programming on the gapped extended hits

E-value Low-complexity regions

About half the human genome is easily recognized as repetitive.

A hit is statistically significant if its score is higher than one obtained from two random sequences.

The alignment score of two random sequences follow the Extreme value distribution

The expected number of hits with score at least S is given by

E = Kmn e-λS

The smaller the E-value is, the more statistically significant the hit is

The significance of a hit is evaluated by E-value

Extreme Value Distribution Positive skewed tail Higher probability to have high score

than normal distribution

0 5-2 s

ln K λ

Prefix-Suffix Hashing Scheme Goals

Speed up hit generation and ungapped extension

Reduce the number of hits so as to reduce the processing costs of the later phases

Design Build hashing indexes on database sequences The index stores the offsets of the words (length

W) of the database sequence During a search, for each sliding window of the

query sequence, lookup the index for the offsets of the hits in the database sequence

Index structure Word pattern – length W Partition into prefix and suffix Its prefix and suffix are represented by its hash valu

e H(T) = ∑(4i * V(T[i])), i [0, |T|-1] V(A) = 0, V(C) = 1, V(G) = 2, V(T) = 3

For each possible prefix Lookup file

For each possible suffix Pointers to the actual offsets of the word pattern Total number N of offsets

Entry file For each possible suffix

The N offsets

Index structure

Prefix: AAAAAPointers

Number of offsetsSuffix:

AAAAAASuffix: AAAAAC

List of offsets

Prefix: AAAACPointers

Number of offsetsSuffix:

AAAAAASuffix: AAAAAC

List of offsets

…

……

Merge

Lookup files Entry files

Build the index For each sliding window of the database se

quence, Divide it into prefix of length P and suffix of leng

th S Store its offset with the prefix and suffix

Flush the offsets to the disk if memory is full Reorganise the offsets on the disk to the cor

responding lookup files and entry files Merge the lookup files as one

During a search Divide the query sequence into sliding

windows of length W For each sliding window,

Compute the hash values of prefix, HP, and suffix, HS

Sort the sliding windows by their HP, then their HS

Access the lookup file for HS at HP block Access the entry file for the offsets for

the hits of the word

Experiments Database sequence: human chromosomes 1 – 4, 84

0M bp Query sequences: randomly selected from human

chromosomes W = 11, P = 5, S = 6 Task:

Compare the order of prefix and suffix Compare hit generation time of the algorithms

BLAST PS-Hash – Prefix-Suffix Hashing Scheme HashQuery – build a lookup table on query sequence and sc

an the database sequence Sequential Scan

Study the ungapped extension in BLAST

Experimental results Two sets of index files built

Prefix as lookup Suffix as lookup

prefix->suffix suffix->prefix

Query length

Eff. len. # of hits total (s) lookup entry total (s) lookup entry

490 490 484925 5.39364 0.454373 4.9388 5.66894 0.504918 5.152831

512 70 20752 1.076 0.293433 0.78247 1.20806 0.303494 0.904463

512 512 336367 6.04708 0.477877 5.568728 6.08929 0.497289 5.591531

513 513 580441 5.65264 0.475084 5.16985 6.0363 0.514839 5.520972

490 452 1288149 5.36993 0.463572 4.905935 5.51566 0.497818 5.006489

Eff. len. Is the effective search length of the query sequence after filtering.

Experimental results

BLAST PS-HashHashQuer

ySequential Scan

Query

length

Eff. len. Hits

Time (s)

Hits Time (s) Time (s) Time (s)

490 49036314

17.0770 484925 5.39364 40.0346 4506.84

512 70 14949 7.6932 20752 1.076 32.0046 558.989

512 51219468

66.7870 336367 6.04708 41.6682 4721.28

513 51324823

36.7912 580441 5.65264 43.1642 4652.43

490 45235007

813.395

1128814

95.36993 40.7474 3868.43

Analysis

Index files Number of word patterns = 411 = 4M Number of prefix patterns = 45 = 1K Number of suffix patterns = 46 = 4K Total size of lookup file = 411 * (4 + 4)

= 32MB Total size of entry files = 840M * 4 =

3GB

Analysis Number of bytes reads

BLAST: compressed sequence file = 210MB PS-Hash: (# of query sliding windows) * (4 + 4) +

(# of hits) * 4 = 1.85MB HashQuery: sequence file = 840MB Sequential Scan: sequence file = 840MB w.r.t. the first query

PS-Hash only accesses 1/113 that of bytes BLAST accesses, but the running time is not much faster, in some cases, even slower Disk Locality

Experimental results BLAST Ungapped extension

Database sequence: 840M bp Query: 512 bp E-value: 10-15

Total number of word hits: 194,686

1

10

100

1000

10000

100000

0 50 100 150 200 250 300 350 400 450

success

failsuc

Conclusion

Introduced local similarity search Described BLAST Proposed Prefix-Suffix Hashing

Scheme Showed experimental results and

comparisons

Future work Optimise implementation of Prefix-Suffix H

ashing Scheme Utilise the information of the number of wo

rd hits produced by each sliding window of the query sequence

Extend the index to store neighbour information about the word patterns

Derive useful threshold to restrict the generation of hits for later phase processing

Test on multiple sequences in database

References BLAST website: http://www.ncbi.nlm.nih.gov/blast/ The Statistics of Sequence Similarity Scores:

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

Samuel Karlin and Stephen F. Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Science USA, 87(6):2264-2268, March 1990.

BLAST. Ian Korf, Mark Yandell and Joseph Bedell. Sebastopol, CA : O'Reilly & Associates, 2003.

WU-BLAST website: http://blast.wustl.edu/ FSA-BLAST website: http://www.fsa-blast.org/

Documents

Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007