Upload
octavia-bridges
View
219
Download
2
Embed Size (px)
Citation preview
CS 345:Topics in Data Warehousing
Thursday, October 28, 2004
Review of Tuesday’s Class
• Bitmap compression with BBC codes– Gaps and Tails– Variable byte-length encoding of lengths– Special handling of lone bits
• Speeding up star joins– Cartesian product of dimensions– Semi-join reduction– Early aggregation
Outline of Today’s Class
• Join Indexes
• Projection Indexes– Horizontal vs. Vertical decomposition
• Bit-Sliced Indexes– Fast bitmap counts and sums– Range queries
• Bit Vector Filtering
*O’Neil and Quass, “Improved Query Performance with Variant Indexes”, 1997
Join Indexes
• Generally, indexes are built on the tuples of a single relation
• Join indexes include attributes from more than one relation– Or else index entries consist of attributes from one
relation and RIDs from another
• Example:– Standard index on Customer.State
• Index entry consists of <State, Customer RID> pair
– Join index on Customer.State & Sales fact• Index entry consists of <State, Sales RID> pair
Defining Join Indexes
• Join indexes are like indexes on a materialized view– Except that the view isn’t actually materialized…
• Oracle syntax:– CREATE BITMAP INDEX ON Sales(Customer.State)FROM Sales, CustomerWHERE Sales.Customer_key = Customer.Customer_key
• Creates an index with <State, Sales RID> entries • Join condition used in joining the tables is specified at
index creation time
Why Join Indexes?
• Consider a star query plan using semi-join reductions– For each dimension:
• Determine keys of dimension rows that satisfy filter conditions• Join list of dimension keys to fact table (using index) • Result = list of fact RIDs
– Merge lists of fact RIDs from all dimensions– Retrieve matching fact rows– Join back to grouping dimensions– Perform grouping and aggregation
• Join indexes allow the join step to be skipped– Join is precomputed when join index created– Fact RID list stored directly in join index
• Bitmap join indexes are particularly common– Other types are possible (B-Tree, etc.)– Bitmap join indexes facilitate RID list merging / index intersection
Limitations of Join Indexes• Index creation cost is high
– Need to compute join result to create index• Index maintenance cost is high
– Usually not an issue in data warehouses– Data warehouse is “read-mostly”– Drop index before warehouse load begins– Re-create after load completes
• Can’t access non-indexed columns of dimension table– SELECT Customer.Age, SUM(Quantity)
FROM Sales, CustomerWHERE Sales.Customer_key = Customer.Customer_keyAND Customer.State = 'CA'
– Using ordinary index on State:• Get dimension table RIDs using index• Look up dimension rows to learn Age for each• Join to fact table to learn Quantity
– Using join index on State:• Get fact RIDs using index• Look up fact rows to learn Quantity• Join back to dimension table to learn Age
Horizontal vs. Vertical Decomposition
• Relations have rows and columns– “Two-dimensional” structure
• Disk access model is one-dimensional stream– Disk spins fast, disk arm moves slowly– Sequential I/O = leave the disk arm in one place– Essentially, disk provides one-dimensional locality
• Need to store the bytes in some particular order– Embedding 2D structure in 1D space
• Two options:– Horizontal decomposition (“row-major” order)– Vertical decomposition (“column-major” order)
Horizontal vs. Vertical
• Horizontal decomposition– All attributes of a record
are stored together– Values for the same
attribute but different records are separated
– Standard DBMS storage model
• Vertical decomposition– All values of an attribute
are stored together– Values for the same record
but different attributes are separated
Col1 Col2 Col3
1 4 7
2 5 8
3 6 9
Col1 Col2 Col3
1 2 3
4 5 6
7 8 9
Horizontal
Vertical
Horizontal vs. Vertical
• Horizontal decomposition– Good locality when:
• Multiple attributes from same record are co-accessed• Not too many records are co-accessed
– For example: Inserts and deletes• Entire record is in one place• Perform insert or delete with a single I/O
• Vertical decomposition– Good locality when:
• Multiple values of the same attribute are co-accessed• Not too many attributes are co-accessed
– Frequently the case for OLAP queries
Vertical Decomposition Example
• SELECT customer_key, incomeFROM CustomerWHERE state = 'CA'AND age > 50– Subplan for OLAP query that filters on state, age and groups on income– Suppose Customer has 100 attributes
• Horizontally decomposed data:– Scan table, one record at a time– Ignore 96 attributes of each record– Use 4 attributes (state, age, income, customer_key)– Lots of wasted I/O bandwidth
• Vertically decomposed data:– Parallel scans of customer_key, income, state, and age columns– If ith state = ‘CA’ and ith age > 50, then output ith customer_key and ith
income– No redundant attributes are read from disk– I/O is kept to a minimum
Projection Index
• The idea behind projection indexes– Databases usually store data in horizontal format– Vertical format is more efficient for many analysis queries– Why not do both?
• Projection index– Logically: Index entries are <Value, RID> pairs– Stored in same order as records in relation
• i.e. sorted by RID instead of sorted by Value– In practice: Storing RID is unnecessary
• Array storage format• Array index determined from RID
• Incremental approach to vertical decomposition– Vertical decomposition = complete set of projection indexes– A few projection indexes = partial vertical decomposition
Using Projection Indexes
• Consider a star query via semi-join decomposition– For each dimension:
• Determine keys of dimension rows that satisfy filter conditions
• Join list of dimension keys to fact table (using index) • Result = list of fact RIDs
– Merge lists of fact RIDs from all dimensions– Retrieve matching fact rows– Join back to grouping dimensions– Perform grouping and aggregation
Using Projection Indexes
• “Retrieve matching fact rows” step can be expensive– All we care about are values of aggregated columns– Other attributes are irrelevant (e.g. dimension foreign
keys)– Poor packing of relevant information into disk pages
• Relevant and irrelevant data is clumped together
• Using a projection index is often faster– List of fact RIDs tells us which index entries to read– Many index entries packed into same disk page
• No “clutter” from irrelevant fields
– Fewer I/Os required
Bit-Sliced Indexes• Bit-sliced indexes
– Generally used for measurement columns– Allow for:
• Efficient aggregation• Efficient range filtering (particularly for large ranges)
– Most suitable for bitmap-based plans– Requires positive integer-valued column
• Fixed-precision decimals OK• Example: Interpret $5.67 as 567 cents
• Bit-sliced index on attribute A– Treat A as multiple logical binary-valued columns
• Column A1 = Least significant bit of A• Column A2 = 2nd least significant bit of A• Etc.• Number of logical columns determined by max value
– Store each column as a separate bitmap
Bit-Sliced Index Example
Amount
5
13
2
6
7
Binary
0101
1101
0010
0110
0111
B4: 01000B3: 11011B2: 00111B1: 11001
Bit-Sliced Index
Fast Bitmap Aggregation
• Take advantage of word-level parallelism– Implicit parallelism arising from SIMD
operations in modern computer architectures– SIMD: Single Instruction, Multiple Data– Processor can compute bitwise operations on
all bits in a word at the same time– Index intersection via bitmap merge takes
advantage of this fact– It’s one reason for byte-aligned compression
Fast Bitmap Count
• Count the number of 1’s in a bitmap– Treat the bitmap as a byte array– Pre-compute lookup table with number of 1’s in each byte– Cycle through bitmap one byte at a time, accumulating count
using lookup table
• Pseudocode– count = 0;for (int i = 0; i < n/8; i++)
count += numSetBits[bitmap[i]];– numSetBits[0] = 0, numSetBits[7] = 3, etc.
• Treating bitmap as short int array → even faster– Lookup table has 65536 entries instead of 256– Bitmap of n bits → only add n/16 numbers
SUM using Bit-Sliced Index
• Suppose Bf represents the foundset
– foundset = List of fact RIDs that pass all filters
• For each bit slice Bi:
– Compute Bf AND Bi
– Do fast bitmap count of resulting bitmap– Multiply count by 2i
• Total = sum of weighted counts for all slices
SUM Example
Amount
5
13
2
6
7
B4: 01000B3: 11011B2: 00111B1: 11001
Bit-Sliced Index
Count of B4: 1Count of B3: 4Count of B2: 3Count of B1: 3
1*24 + 4*23 + 3*22 + 3*21
= 8 + 16 + 6 + 3 = 33
5 + 13 + 2 + 6 + 7 = 33
Range Filtering with Bit-Sliced Indexes
• Bit-sliced indexes allow range filtering
• Cost of applying range predicate independent of size of range– Not true for bitmap indexes, B-Trees
• We’ll give the algorithm for “A < c”– A is the attribute that is indexed– c is some constant– Other operations (>, =, etc.) are similar
Pseudocode for “A < c”• Set BLT = all zeros. Set BEQ = all ones.• For each bit slice Bi, from most to least significant:
– If bit i of constant c is 1:• BLT = BLT OR (BEQ AND NOT(Bi))• BEQ = BEQ AND Bi
– If bit i of constant c is 0:• BEQ = BEQ AND NOT(Bi)
• Return BLT.
• Why does it work?• Invariant: BEQ = 1 for all rows that match c on the
most significant bits (and only those rows)• A value x is less than c iff for some bit i:
– x and c agree on all bits more significant than i– The ith bit of x is 0, and the ith bit of c is 1
“Amount < 7” ExampleAmount
5
13
2
6
7
B4: 01000B3: 11011B2: 00111B1: 11001
Bit-Sliced Index
7 = 0111
BLT BEQ
00000 11111000000010010100
10110
101111001100011
00001
Bit-Sliced vs. Projection Index
• Both benefit from vertical decomposition– Values for 1 column are packed onto as few pages as
possible
• Both allow for fast aggregation• Both allow range filtering• Bit-sliced indexes can be faster when:
– Attribute values don’t use full data type precision– Query is CPU-bound
• Fewer machine instructions due to SIMD parallelism
• Bit-sliced indexes are more complicated
Bit Vector Filtering
• Sometimes called “Bloom Join”– From “Bloom Filters” [Bloom 1970]
• Bloom filter = cheap, approximate semi-join– Store bitmap with 1 bit per hash bucket– Initialize bitmap to all zeros– For each record in relation A:
• Compute hash value of join key• Set the bit for the appropriate hash bucket to 1
• In distributed database, send bitmap to Server 2.• For each record in relation B:
– Compute hash value of join key– Check whether bit for the appropriate bucket is set
• Yes → Record might join to something in A• No → Record definitely doesn’t join to anything in A
• Send qualifying B records to Server 1 & compute join
Bit Vector Filtering• Also useful in non-distributed database• Applies to multi-pass hash or merge join• Hash join
– Partition relation A– Partition relation B– Join each A partition with matching B partition
• Merge join– Generate sorted runs for relation A– Generate sorted runs for relation B– Merge A’s runs, Merge B’s runs, Merge A & B
• Bit vector filtering– While pre-processing A, generate Bloom filter– While pre-processing B, discard records that don’t match filter– Significantly reduce size of B at low cost– Since A is already being scanned, generating the Bloom filter is “free”– Bloom filter can be made as small or large as memory permits
• Fewer buckets → more collisions → less effective filtering• Fewer buckets → less memory to store bitmap
Next week: Physical DB Design
• This concludes the query processing topic• Next week, we’ll begin physical database design
– Selection of indexes and materialized views– Partitioning and data layout– RAID and hardware considerations– Physical design trade-offs– Database tuning
• Note on course project:– Start thinking about topics– We’ll discuss the project in more detail next class.