30
How do Column Stores Work?

Optimising Column stores with statistical analysis

Embed Size (px)

DESCRIPTION

A presentation about column stores, how they work and how you can optimise compression with them

Citation preview

Page 1: Optimising Column stores with statistical analysis

How do Column Stores Work?

Page 2: Optimising Column stores with statistical analysis

Turning Rows into Columns

Product Customer

Date Sale

Beer Thomas 2011-11-25

2 GBP

Beer Thomas 2011-11-25

2 GBP

Vodka Thomas 2011-11-25

10 GBP

Whiskey Christian 2011-11-25

5 GBP

Whiskey Christian 2011-11-25

5 GBP

Vodka Alexei 2011-11-25

10 GBP

Vodka Alexei 2011-11-25

10 GBP

Sales

ID

Value

1 Beer

2 Beer

3 Vodka

4 Whiskey

5 Whiskey

6 Vodka

7 Vodka

ID

Customer

1 Thomas

2 Thomas

3 Thomas

4 Christian

5 Christian

6 Alexei

7 Alexei

Product Customer

And so on… until…

Page 3: Optimising Column stores with statistical analysis

And we get…

ID

Value

1 Beer

2 Beer

3 Vodka

4 Whiskey

5 Whiskey

6 Vodka

7 Vodka

ID

Customer

1 Thomas

2 Thomas

3 Thomas

4 Christian

5 Christian

6 Alexei

7 Alexei

Product Customer

ID

Date

1 2011-11-25

2 2011-11-25

3 2011-11-25

4 2011-11-25

5 2011-11-25

6 2011-11-25

7 2011-11-25

Date

ID

Sale

1 2 GBP

2 2 GBP

3 10 GBP

4 5 GBP

5 5 GBP

6 10 GBP

7 10 GBP

Sale

Page 4: Optimising Column stores with statistical analysis

And what now?

ID

Value

1 Beer

2 Beer

3 Vodka

4 Whiskey

5 Whiskey

6 Vodka

7 Vodka

Product

Run lengthEncode

Product’

ID Value

1-2 Beer

3 Vodka

4-5 Whiskey

6-7 Vodka

Page 5: Optimising Column stores with statistical analysis

Applying Compression

ID Value

1-2

Beer

3 Vodka

4-5

Whiskey

6-7

Vodka

ID Customer

1-3 Thomas

4-5 Christian

6-7 Alexei

Product’ Customer’

ID Date

1-7

2011-11-25

Date’

ID Sale

1-2 2 GBP

3 10 GBP

4-5 5 GBP

6-7 10 GBP

Sale’

Page 6: Optimising Column stores with statistical analysis

Insights• With dictionary, every

value can be assumed to fit a machine word (64bits)

• Compression is proportional with total number of run length (RL) in all columns

• Number of RL will depend on ordering of rows

ID Value

1-2

Beer

3 Vodka

4-5

Whiskey

6-7

Vodka

Product’

One RL

Page 7: Optimising Column stores with statistical analysis

Ordering Example

Product

Customer

Beer Thomas

Beer Thomas

Vodka Thomas

Whiskey

Christian

Whiskey

Christian

Vodka Alexei

Vodka Alexei

Product

Customer

Beer Thomas

Whiskey

Christian

Vodka Thomas

Whiskey

Christian

Beer Thomas

Vodka Alexei

Vodka Alexei

Product

Customer

Beer Thomas

Whiskey

Christian

Vodka Thomas

Whiskey

Christian

Beer Thomas

Vodka Alexei

Product

Customer

BeerThomas

Vodka

Whiskey

Christian

Vodka Alexei

VS.

Page 8: Optimising Column stores with statistical analysis

There is some overhead…

Clusteron ID

Heap

Data Size 327MB 327MB

Column Index Size

59MB 142MB

Page 9: Optimising Column stores with statistical analysis

Manipulating the Rules

Page 10: Optimising Column stores with statistical analysis

Rule of Thumb?

“Sort by lowest cardinality column first”

Rationale: Low cardinality columns

have potential for long RL(C1, C2): 68MB (C2, C1): 61MB Lowest first is

worse!

x N

Page 11: Optimising Column stores with statistical analysis

OK, so what about highest first?Loose correlation

(C1, C2): 64MB (C2, C1): 68MB

Highest first is worse!

Page 12: Optimising Column stores with statistical analysis

What are we looking for?1) Values that are skewed or have low cardinality

2) Columns that correlate/cluster with other columns

Page 13: Optimising Column stores with statistical analysis

Just Read the Magic Code?• Values with low cardinality are

easy (COUNT DISTINCT)• Is there a more general way to

classify the notion of “predictable content of a column”?

• Yes, Entropy:

Page 14: Optimising Column stores with statistical analysis

Coming to Terms with Entropy• Intuition: A single number expressing the amount of

“surprise” at seeing a value in a column• Consider an example:

SKEW SPLAT ID

Histogram

COUNT DISTINCT

10001 10001 1000000

DISTINCT / COUNT

0.01 0.01 1

Page 15: Optimising Column stores with statistical analysis

Calculate and Evaluate

SKEW

SPLAT

ID

≈ 0.21

≈ 13

≈ 20

Page 16: Optimising Column stores with statistical analysis

New theory: Lower Entropy First

You will NEVER win

Take best of these

Page 17: Optimising Column stores with statistical analysis

Column that “cluster” with other columns?• Is there a way to calculate this?• Yes indeed, information theory to the help again• Mutual information:

• “The information left in Y, given that I know X”

Page 18: Optimising Column stores with statistical analysis

Mutual WHAT?

H(X ¦ Y) H(Y ¦ X) I(X;Y)

Page 19: Optimising Column stores with statistical analysis

From I(X;Y) we can find the distance

X

Y

C1

C2

C3

“Find the minimal distancethat visits all columns

in the information plane”

d(c1,c2)

d(c2,c3)

Page 20: Optimising Column stores with statistical analysis

So, how does THAT work?

Better… not impressive.. More consistent

Take best of these

Page 21: Optimising Column stores with statistical analysis

Medicine is compared against placebo

Page 22: Optimising Column stores with statistical analysis

What else does d(X,Y) tell us?• Consider this fact

table:d(A, B) is zero!

What is our expected estimate of rows?

Page 23: Optimising Column stores with statistical analysis

Dodge this!

Fix:

Page 24: Optimising Column stores with statistical analysis

Why is this so Hard?

Page 25: Optimising Column stores with statistical analysis

Reflecting on Information Distance“Find the shortest paththat visits all cities on

a map”

Picture Credits: RUC.dk

How many routes are there?

n! = n * (n-1) * (n-2) * … * 1

Travelling Salesman Problem

(TSP)

Page 26: Optimising Column stores with statistical analysis

There are MORE than n! routes• What if lexicographical ordering of the columns isn’t

best?• Daniel Lemire et al: ”Reordering Rows for Better Compression: Beyond

the Lexicographic Order“ ( http://arxiv.org/pdf/1207.2189.pdf )

• Some may be ruled out immediately (ex: don’t go to Skagen from Copenhagen and then to Roskilde)

• The issue of local optimums exist

Page 27: Optimising Column stores with statistical analysis

Heuristics are your Best Bet• “Find Minimum RLE” can be shown to be NP complete• There is no fast algorithm that finds the optimal

• I have shown you one heuristic• moderate gain for a small effort• Shown that 2x gains are possible• Any ordering is (typically) better than random, often by a lot

• I wrote a tool to help analyse: TableStat.exe• Interested? Come up to talk after• I need more real life datasets to test on

Page 28: Optimising Column stores with statistical analysis
Page 29: Optimising Column stores with statistical analysis
Page 30: Optimising Column stores with statistical analysis

P = NP ? We just don’t know