Upload
thomas-kejser
View
884
Download
0
Embed Size (px)
DESCRIPTION
A presentation about column stores, how they work and how you can optimise compression with them
Citation preview
How do Column Stores Work?
Turning Rows into Columns
Product Customer
Date Sale
Beer Thomas 2011-11-25
2 GBP
Beer Thomas 2011-11-25
2 GBP
Vodka Thomas 2011-11-25
10 GBP
Whiskey Christian 2011-11-25
5 GBP
Whiskey Christian 2011-11-25
5 GBP
Vodka Alexei 2011-11-25
10 GBP
Vodka Alexei 2011-11-25
10 GBP
Sales
ID
Value
1 Beer
2 Beer
3 Vodka
4 Whiskey
5 Whiskey
6 Vodka
7 Vodka
ID
Customer
1 Thomas
2 Thomas
3 Thomas
4 Christian
5 Christian
6 Alexei
7 Alexei
Product Customer
And so on… until…
And we get…
ID
Value
1 Beer
2 Beer
3 Vodka
4 Whiskey
5 Whiskey
6 Vodka
7 Vodka
ID
Customer
1 Thomas
2 Thomas
3 Thomas
4 Christian
5 Christian
6 Alexei
7 Alexei
Product Customer
ID
Date
1 2011-11-25
2 2011-11-25
3 2011-11-25
4 2011-11-25
5 2011-11-25
6 2011-11-25
7 2011-11-25
Date
ID
Sale
1 2 GBP
2 2 GBP
3 10 GBP
4 5 GBP
5 5 GBP
6 10 GBP
7 10 GBP
Sale
And what now?
ID
Value
1 Beer
2 Beer
3 Vodka
4 Whiskey
5 Whiskey
6 Vodka
7 Vodka
Product
Run lengthEncode
Product’
ID Value
1-2 Beer
3 Vodka
4-5 Whiskey
6-7 Vodka
Applying Compression
ID Value
1-2
Beer
3 Vodka
4-5
Whiskey
6-7
Vodka
ID Customer
1-3 Thomas
4-5 Christian
6-7 Alexei
Product’ Customer’
ID Date
1-7
2011-11-25
Date’
ID Sale
1-2 2 GBP
3 10 GBP
4-5 5 GBP
6-7 10 GBP
Sale’
Insights• With dictionary, every
value can be assumed to fit a machine word (64bits)
• Compression is proportional with total number of run length (RL) in all columns
• Number of RL will depend on ordering of rows
ID Value
1-2
Beer
3 Vodka
4-5
Whiskey
6-7
Vodka
Product’
One RL
Ordering Example
Product
Customer
Beer Thomas
Beer Thomas
Vodka Thomas
Whiskey
Christian
Whiskey
Christian
Vodka Alexei
Vodka Alexei
Product
Customer
Beer Thomas
Whiskey
Christian
Vodka Thomas
Whiskey
Christian
Beer Thomas
Vodka Alexei
Vodka Alexei
Product
Customer
Beer Thomas
Whiskey
Christian
Vodka Thomas
Whiskey
Christian
Beer Thomas
Vodka Alexei
Product
Customer
BeerThomas
Vodka
Whiskey
Christian
Vodka Alexei
VS.
There is some overhead…
Clusteron ID
Heap
Data Size 327MB 327MB
Column Index Size
59MB 142MB
Manipulating the Rules
Rule of Thumb?
“Sort by lowest cardinality column first”
Rationale: Low cardinality columns
have potential for long RL(C1, C2): 68MB (C2, C1): 61MB Lowest first is
worse!
x N
OK, so what about highest first?Loose correlation
(C1, C2): 64MB (C2, C1): 68MB
Highest first is worse!
What are we looking for?1) Values that are skewed or have low cardinality
2) Columns that correlate/cluster with other columns
Just Read the Magic Code?• Values with low cardinality are
easy (COUNT DISTINCT)• Is there a more general way to
classify the notion of “predictable content of a column”?
• Yes, Entropy:
Coming to Terms with Entropy• Intuition: A single number expressing the amount of
“surprise” at seeing a value in a column• Consider an example:
SKEW SPLAT ID
Histogram
COUNT DISTINCT
10001 10001 1000000
DISTINCT / COUNT
0.01 0.01 1
Calculate and Evaluate
SKEW
SPLAT
ID
≈ 0.21
≈ 13
≈ 20
New theory: Lower Entropy First
You will NEVER win
Take best of these
Column that “cluster” with other columns?• Is there a way to calculate this?• Yes indeed, information theory to the help again• Mutual information:
• “The information left in Y, given that I know X”
Mutual WHAT?
H(X ¦ Y) H(Y ¦ X) I(X;Y)
From I(X;Y) we can find the distance
X
Y
C1
C2
C3
“Find the minimal distancethat visits all columns
in the information plane”
d(c1,c2)
d(c2,c3)
So, how does THAT work?
Better… not impressive.. More consistent
Take best of these
Medicine is compared against placebo
What else does d(X,Y) tell us?• Consider this fact
table:d(A, B) is zero!
What is our expected estimate of rows?
Dodge this!
Fix:
Why is this so Hard?
Reflecting on Information Distance“Find the shortest paththat visits all cities on
a map”
Picture Credits: RUC.dk
How many routes are there?
n! = n * (n-1) * (n-2) * … * 1
Travelling Salesman Problem
(TSP)
There are MORE than n! routes• What if lexicographical ordering of the columns isn’t
best?• Daniel Lemire et al: ”Reordering Rows for Better Compression: Beyond
the Lexicographic Order“ ( http://arxiv.org/pdf/1207.2189.pdf )
• Some may be ruled out immediately (ex: don’t go to Skagen from Copenhagen and then to Roskilde)
• The issue of local optimums exist
Heuristics are your Best Bet• “Find Minimum RLE” can be shown to be NP complete• There is no fast algorithm that finds the optimal
• I have shown you one heuristic• moderate gain for a small effort• Shown that 2x gains are possible• Any ordering is (typically) better than random, often by a lot
• I wrote a tool to help analyse: TableStat.exe• Interested? Come up to talk after• I need more real life datasets to test on
P = NP ? We just don’t know