16
q-gram Based q-gram Based Database Searching Database Searching Using A Suffix Using A Suffix Array (QUASAR) Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Institut f. Informatik, Saarbrücken Deutsches Krebsforschungszentrum, E. Rivals P. Ferragina M. Vingron

Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

Embed Size (px)

Citation preview

Page 1: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

q-gram Based Database q-gram Based Database Searching Using A Suffix Searching Using A Suffix

Array (QUASAR)Array (QUASAR)

S. Burkhardt

A. Crauser

H-P. LenhofMax-Planck Institut f. Informatik, Saarbrücken Deutsches Krebsforschungszentrum, Heidelberg

E. Rivals

P. Ferragina

M. Vingron

Page 2: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

OutlineOutlineExisting WorkMotivationProblem AlgorithmResults

Page 3: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

Examples : • BLAST• FASTA

Linear Scan (No Index) Good Sensitivity

Page 4: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

Today: New Applications Examples:

• EST-Clustering

• Large Scale Shotgun Assembly

Low Sensitivity Multiple SearchesSpecialized Algorithms Needed

Page 5: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

Pattern P

T C G A T T A C A G T G A A T

Local Alignment, minimum Length w

w = 8

Low Error Rate (<10% Edit Distance)

Database D

G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T

Page 6: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

•Filter Step:•Identify Hotspots

•Scan Step:•Scan Hotspots with BLAST

Page 7: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

T C GC G A

G A TA T T

T T AT A C

G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T

q = 3# of q-grams : |P| - q + 1

Edit Distance e : at least t = |P| - q + 1 - (qe) common q-grams

• q-gram Filtration• Block Addressing• Suffix Array• Window Shifting

T C G A T T A CT C G A T T A C A G T G A A T

w = 8

Page 8: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

1 02 03 0

G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T

T C G A T T A C

• q-gram Filtration• Block Addressing• Suffix Array• Window Shifting

Scan Blocks with counter t

How to find the matching q-grams?

Divide D into Blocks Count matching q-grams per Block

4 0

Page 9: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T

T C G A T T A C

• q-gram Filtration• Block Addressing• Suffix Array• Window Shifting

Precompute Searches for q-grams, O(1) Time Access

AAA : 0AAC : 0AAG : 0AAT : 0ACA : 1ACC : 1ACG : 1ACT : 1

AGA : 3AGC : 3AGG : 3AGT : 3ATA : 4ATC : 4ATG : 4ATT : 5

TGA : 26TGC : 27TGG : 27TGT : 29TTA : 29TTC : 29TTG : 30TTT : 30

23 16 11 3

Sorted List of Pointers to Suffixes, O(log |D|) Access Time

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Page 10: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T

4 00 73 02 1

T C G A T T A C A G T G A A T

• q-gram Filtration• Block Addressing• Suffix Array• Window Shifting

Scan Marked Blocks

q = 3w = 8e = 1t = 3

Mark full Blocks for each Window

Move Window over Query

T C G A T T A C

Page 11: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

Influence of the Block Size Sensitivity Running Times Overhead for loading the Index Benchmark System:Ultra Sparc Processor, 333Mhz, 4GB RAM

Page 12: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

512 1024 2048 4096 8192

Block Size

Tim

e i

n S

ec

on

ds

Filter Time

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

512 1024 2048 4096 8192

Block Size

Tim

e i

n S

ec

on

ds

Total Time

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

512 1024 2048 4096 8192

Block Size

Tim

e i

n S

ec

on

ds

Scan Time

Influence of Block Size

Page 13: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

Sensitivity

1000 Queries BLAST Cutoff E = 0.00001 Number of identical hitlists

• Mouse EST DB: 91.4 %• Human EST DB: 97.1 %

QUASAR finds many Hits below selected Error Level

Page 14: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

Running Times

Test Parameters: 6% Error w = 50 q = 11 block size 2048 scan with BLAST time averaged for

1000 queries ~30 times faster

than BLAST 0.123

3.371

0.380

13.275

0

2

4

6

8

10

12

14

Ru

nn

ing

tim

es

in

se

co

nd

s

Mouse EST Human EST

QUASAR BLAST

Page 15: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches

Overhead for Loading the Index

1000 queries Human EST DB, 280 Mbps BLAST Test Run:

• 5 seconds Load Time• 13.270 seconds Search Time

QUASAR Test Run:• 90 seconds Load Time• 380 seconds Search Time

Page 16: Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches