Upload
dominic-perkins
View
216
Download
0
Embed Size (px)
Citation preview
q-gram Based Database q-gram Based Database Searching Using A Suffix Searching Using A Suffix
Array (QUASAR)Array (QUASAR)
S. Burkhardt
A. Crauser
H-P. LenhofMax-Planck Institut f. Informatik, Saarbrücken Deutsches Krebsforschungszentrum, Heidelberg
E. Rivals
P. Ferragina
M. Vingron
OutlineOutlineExisting WorkMotivationProblem AlgorithmResults
Examples : • BLAST• FASTA
Linear Scan (No Index) Good Sensitivity
Today: New Applications Examples:
• EST-Clustering
• Large Scale Shotgun Assembly
Low Sensitivity Multiple SearchesSpecialized Algorithms Needed
Pattern P
T C G A T T A C A G T G A A T
Local Alignment, minimum Length w
w = 8
Low Error Rate (<10% Edit Distance)
Database D
G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T
•Filter Step:•Identify Hotspots
•Scan Step:•Scan Hotspots with BLAST
T C GC G A
G A TA T T
T T AT A C
G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T
q = 3# of q-grams : |P| - q + 1
Edit Distance e : at least t = |P| - q + 1 - (qe) common q-grams
• q-gram Filtration• Block Addressing• Suffix Array• Window Shifting
T C G A T T A CT C G A T T A C A G T G A A T
w = 8
1 02 03 0
G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T
T C G A T T A C
• q-gram Filtration• Block Addressing• Suffix Array• Window Shifting
Scan Blocks with counter t
How to find the matching q-grams?
Divide D into Blocks Count matching q-grams per Block
4 0
G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T
T C G A T T A C
• q-gram Filtration• Block Addressing• Suffix Array• Window Shifting
Precompute Searches for q-grams, O(1) Time Access
AAA : 0AAC : 0AAG : 0AAT : 0ACA : 1ACC : 1ACG : 1ACT : 1
AGA : 3AGC : 3AGG : 3AGT : 3ATA : 4ATC : 4ATG : 4ATT : 5
TGA : 26TGC : 27TGG : 27TGT : 29TTA : 29TTC : 29TTG : 30TTT : 30
23 16 11 3
Sorted List of Pointers to Suffixes, O(log |D|) Access Time
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T
4 00 73 02 1
T C G A T T A C A G T G A A T
• q-gram Filtration• Block Addressing• Suffix Array• Window Shifting
Scan Marked Blocks
q = 3w = 8e = 1t = 3
Mark full Blocks for each Window
Move Window over Query
T C G A T T A C
Influence of the Block Size Sensitivity Running Times Overhead for loading the Index Benchmark System:Ultra Sparc Processor, 333Mhz, 4GB RAM
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
512 1024 2048 4096 8192
Block Size
Tim
e i
n S
ec
on
ds
Filter Time
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
512 1024 2048 4096 8192
Block Size
Tim
e i
n S
ec
on
ds
Total Time
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
512 1024 2048 4096 8192
Block Size
Tim
e i
n S
ec
on
ds
Scan Time
Influence of Block Size
Sensitivity
1000 Queries BLAST Cutoff E = 0.00001 Number of identical hitlists
• Mouse EST DB: 91.4 %• Human EST DB: 97.1 %
QUASAR finds many Hits below selected Error Level
Running Times
Test Parameters: 6% Error w = 50 q = 11 block size 2048 scan with BLAST time averaged for
1000 queries ~30 times faster
than BLAST 0.123
3.371
0.380
13.275
0
2
4
6
8
10
12
14
Ru
nn
ing
tim
es
in
se
co
nd
s
Mouse EST Human EST
QUASAR BLAST
Overhead for Loading the Index
1000 queries Human EST DB, 280 Mbps BLAST Test Run:
• 5 seconds Load Time• 13.270 seconds Search Time
QUASAR Test Run:• 90 seconds Load Time• 380 seconds Search Time