FastBit for Allele Data Dave Matthews USDA-ARS, Cornell University Ithaca, NY 10 April 2012

Preview:

Citation preview

FastBit for Allele DataFastBit for Allele Data

Dave MatthewsDave MatthewsUSDA-ARS, Cornell UniversityUSDA-ARS, Cornell University

Ithaca, NYIthaca, NY

10 April 201210 April 2012

A Lightning-Fast Index Drives Massive Data Analysis

http://www.scidacreview.org/0904/html/fastbit.html

FastBit significantly improves the speed of a searching operation onboth high- and low-cardinality values with a number of techniques,including a vertical data organization, an innovative bitmap compressiontechnique, and several new bitmap encoding methods...The ability to index high-cardinality data is unique to FastBit and isnot supported by other bitmap indexing methods.

Allele Data Variables

Allele = f(Marker, Line, Experiment)Size:

10^9 10^4 10^4 10^1

Cardinality:

2 = = =

Bitmap Indexing

The FastBit Technologies

1. vertical data organization

= 'vertical partitioning'. Only a few of the

(hundreds of) variables in each partition.

2. bitmap compression: Word-Aligned Hybrid Compression

3. two-level bitmap encoding

Word-aligned Hybrid Compression

• run-length encoding• 31-bit groups

Two-level Bitmap Encoding

• Approximate solution, then refine.

• Bin the values into groups, e.g. A to G, H to P, Q to Z.

• Encode the bin identifiers as bitmap.

• Encodings: equality, range, interval.– Interval has half the number of bitmap indexes.

• Multicomponent encoding: Bin the bins to reduce number of bitmap indexes.

• Multi-level encoding: hierarchy of bins, coarse to fine. Use interval encoding for coarse, equality for fine.

Indexing Bin Identifiers

Querying on more than one variable

FastBit performs extremely well on multi-variable queries because the intersection between the search results on each variable is a simple AND operation over the resulting bitmaps.

Performance

Instructions

http://crd-legacy.lbl.gov/~kewu/fastbit/doc/quickstart.html

Recommended