87
BITMAPS & Starjoin s

BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Embed Size (px)

Citation preview

Page 1: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

BITMAPS & Starjoins

Page 2: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Indexing datacubes

Objective: speed queries up.

Traditional databases (OLTP): B-Trees

• Time and space logarithmic to the amount of indexed keys.

• Dynamic, stable and exhibit good performance under updates. (But OLAP is not about updates….)

Bitmaps:

• Space efficient

• Difficult to update (but we don’t care in DW).

• Can effectively prune searches before looking at data.

Page 3: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

BitmapsR = (…., A,….., M)

R (A) B8 B7 B6 B5 B4 B3 B2 B1 B0

3 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 7 0 1 0 0 0 0 0 0 0 5 0 0 0 1 0 0 0 0 0 6 0 0 1 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0

Page 4: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Query optimization

Consider a high-selectivity-factor query with predicates on two attributes.

Query optimizer: builds plans(P1) Full relation scan (filter as you go).(P2) Index scan on the predicate with lower selectivity

factor, followed by temporary relation scan, to filter out non-qualifying tuples, using the other predicate. (Works well if data is clustered on the first index key).

(P3) Index scan for each predicate (separately), followed by merge of RID.

Page 5: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Query optimization (continued)

(P2)

Blocks of data

Pred. 2

answer

t1

tn

Index Pred1

(P3)

t1

tn

Index Pred2

Tuple list1

Tuple list2

Merged list

Page 6: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Query optimization (continued)

When using bitmap indexes (P3) can be an easy winner!

CPU operations in bitmaps (AND, OR, XOR, etc.) are more efficient than regular RID merges: just apply the binary operations to the bitmaps

(In B-trees, you would have to scan the two lists and select tuples in both -- merge operation--)

Of course, you can build B-trees on the compound key, butwe would need one for every compound predicate (exponential number of trees…).

Page 7: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Bitmaps and predicates

A = a1 AND B = b2

Bitmap for a1 Bitmap for b2

AND =

Bitmap for a1 and b2

Page 8: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Tradeoffs

Dimension cardinality small dense bitmaps

Dimension cardinality large sparse bitmaps

Compression

(decompression)

Page 9: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Bitmap for prod

Bitmap for prod

…..

Query strategy for Star joinsMaintain join indexes between fact table and dimension tables

Prod.

Fact tableProduct Type Location

Dimension table

a ... k

Bitmap for type a

Bitmap for type k

…..Bitmap for loc.

Bitmap for loc.

…..

Page 10: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Strategy example

Aggregate all sales for products of location , or

Bitmap for Bitmap for Bitmap for

OR OR =

Bitmap for predicate

Page 11: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Star-JoinsSelect F.S, D1.A1, D2.A2, …. Dn.An

from F,D1,D2,Dn where F.A1 = D1.A1

F.A2 = D2.A2 … F.An = Dn.An

and D1.B1 = ‘c1’ D2.B2 = ‘p2’ ….

Likely strategy:

For each Di find suitable values of Ai such that Di.Bi = ‘xi’ (unless you have a bitmap index for Bi). Use bitmap index on Ai’ values to form a bitmap for related rows of F (OR-ing the bitmaps).

At this stage, you have n such bitmaps, the result can be found AND-ing them.

Page 12: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Example

Selectivity/predicate = 0.01 (predicates on the dimension tables) n predicates (statistically independent)Total selectivity = 10 -2n

Facts table = 108 rows, n = 3, tuples in answer = 108/ 106 = 100 rows. In the worst case = 100 blocks… Still better than all the blocks in the relation (e.g., assuming 100 tuples/block, this would be 106 blocks!)

Page 13: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Design Space of Bitmap Indexes

The basic bitmap design is called Value-list index. The focus there is on the columns. If we change the focus to the rows, the index becomes a set of attribute values (integers) in each tuple (row), that can be represented in a particular way.

5 0 0 0 1 0 0 0 0 0

We can encode this row in many ways...

Page 14: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Attribute value decomposition

C = attribute cardinality Consider a value of the attribute, v, and a sequence of numbers <bn-1, bn-2 , …,b1>. Also, define bn = C / bi , then v can be decomposed into a sequence of n digits <vn, vn-1, vn-2 , …,v1> as follows:

v = V1

= V2 b1 + v1

= V3(b2b1) + v2 b1 + v1

… n-1 i-1 = vn ( bj) + …+ vi ( bj) + …+ v2b1 + v1

where vi = Vi mod bi and Vi = Vi-1/bi-1

Page 15: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

<10,10,10> (decimal system!)

576 = 5 x 10 x 10 + 7 x 10 + 6

576/100 = 5 | 76

76/10 = 7 | 6

6

Number systems

How do you write 576 in:

<2,2,2,2,2,2,2,2,2>

576 = 1 x 29 + 0 x 28 + 0 x 27 + 1 x 26 + 0 x 25 + 0 x 24 + 0 x 23 +

0 x 22+ 0 x 21 + 0 x 20

576/ 29 = 1 | 64, 64/ 28 = 0|64, 64/ 27 = 0|64, 64/ 26 = 1|0,

0/ 25 = 0|0, 0/ 24= 0|0, 0/ 23= 0|0, 0/ 22 = 0|0, 0/ 21 = 0|0, 0/

20 = 0|0

< 7,7,5,3>

576/(7x7x5x3) = 576/735 = 0 | 576, 576/(7x5x3)=576/105=5|51

576 = 5 x (7x5x3)+51

51/(5x3) = 51/15 = 3 | 6

576 = 5 x (7x5x3) + 3 (5 x 3) + 16

6/3 =2 | 0

576 = 5 x (7x 5 x 3) + 3 x (5 x 3 ) + 2 x (3)

Page 16: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

BitmapsR = (…., A,….., M) value-list index

R (A) B8 B7 B6 B5 B4 B3 B2 B1 B0

3 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 7 0 1 0 0 0 0 0 0 0 5 0 0 0 1 0 0 0 0 0 6 0 0 1 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0

Page 17: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Examplesequence <3,3> value-list index (equality)

R (A) B22

B12

B02 B2

1 B11 B0

1

3 (1x3+0) 0 1 0 0 0 1 2 0 0 1 1 0 0 1 0 0 1 0 1 0 2 0 0 1 1 0 0 8 1 0 0 1 0 0 2 0 0 1 1 0 0 2 0 0 1 1 0 0 0 0 0 1 0 0 1 7 1 0 0 0 1 0 5 0 1 0 1 0 0 6 1 0 0 0 0 1 4 0 1 0 0 1 0

Page 18: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Encoding scheme

Equality encoding: all bits to 0 except the one that corresponds to the value

Range Encoding: the vi righmost bits to 0, the remaining to 1

Page 19: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Range encodingsingle component, base-9

R (A) B8 B7 B6 B5 B4 B3 B2 B1 B0

3 1 1 1 1 1 1 0 0 0 2 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 8 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 7 1 1 0 0 0 0 0 0 0 5 1 1 1 1 0 0 0 0 0 6 1 1 1 0 0 0 0 0 0 4 1 1 1 1 1 0 0 0 0

Page 20: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Example (revisited)sequence <3,3> value-list index(Equality)

R (A) B22

B12

B02 B2

1 B11 B0

1

3 (1x3+0) 0 1 0 0 0 1 2 0 0 1 1 0 0 1 0 0 1 0 1 0 2 0 0 1 1 0 0 8 1 0 0 1 0 0 2 0 0 1 1 0 0 2 0 0 1 1 0 0 0 0 0 1 0 0 1 7 1 0 0 0 1 0 5 0 1 0 1 0 0 6 1 0 0 0 0 1 4 0 1 0 0 1 0

Page 21: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Examplesequence <3,3> range-encoded index

R (A) B12

B02 B1

1 B01

3 1 0 1 1 2 1 1 0 0 1 1 1 1 0 2 1 1 0 0 8 0 0 0 0 2 1 1 0 0 2 1 1 0 0 0 1 1 1 1 7 0 0 1 0 5 1 0 0 0 6 0 0 1 1 4 1 0 1 0

Page 22: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Design Space

b Value-list

log2C b,b,…,b

Bit-Sliced

<b2,b1>

….

equality range

Page 23: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

RangeEval

Evaluates each range predicate by computing two bitmaps: BEQ bitmap and either BGT or BLT

RangeEval-Opt uses only <=

A < v is the same as A <= v-1

A > v is the same as Not( A <= v)

A >= v is the same as Not (A <= v-1)

Page 24: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

RangeEval-OPT

Page 25: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 26: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

• Classification: – predicts categorical class labels– classifies data (constructs a model) based on the training

set and the values (class labels) in a classifying attribute and uses it in classifying new data

• Prediction: – models continuous-valued functions, i.e., predicts

unknown or missing values

• Typical Applications– credit approval– target marketing– medical diagnosis– treatment effectiveness analysis

Classification vs. Prediction

Page 27: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

• Pros:– Fast.

– Rules easy to interpret.

– High dimensional data

• Cons:– No correlations

– Axis-parallel cuts.

• Supervised learning (classification)– Supervision: The training data

(observations, measurements, etc.) are accompanied by labels indicating the class of the observations

– New data is classified based on the training set

• Unsupervised learning (clustering)– The class labels of training data is unknown– Given a set of measurements, observations,

etc. with the aim of establishing the existence of classes or clusters in the data

• Decision tree – A flow-chart-like tree structure– Internal node denotes a test on an attribute– Branch represents an outcome of the test– Leaf nodes represent class labels or class distribution

• Decision tree generation consists of two phases– Tree construction

• At start, all the training examples are at the root• Partition examples recursively based on selected attributes

– Tree pruning• Identify and remove branches that reflect noise or outliers

• Use of decision tree: Classifying an unknown sample– Test the attribute values of the sample against the decision tree

Page 28: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Algorithm for Decision Tree Induction

• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer

manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are

discretized in advance)– Examples are partitioned recursively based on selected attributes– Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain)

• Conditions for stopping partitioning– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority

voting is employed for classifying the leaf– There are no samples left

Page 29: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Decision tree algorithms• Building phase:

– Recursively split nodes using best splitting attribute and value for node

• Pruning phase:– Smaller (yet imperfect) tree achieves better

prediction accuracy.– Prune leaf nodes recursively to avoid over-fitting.

DATA TYPES• Numerically ordered: values are ordered and they can

be represented in real line. ( E.g., salary.)• Categorical: takes values from a finite set not having

any natural ordering. (E.g., color.)• Ordinal: takes values from a finite set whose values

posses a clear ordering, but the distances between them are unknown. (E.g., preference scale: good, fair, bad.)

Page 30: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Some probability...S = casesfreq(Ci,S) = # cases in S that belong to CiGain entropic meassure:Prob(“this case belongs to Ci”) = freq(Ci,S)/|S|Information conveyed: -log (freq(Ci,S)/|S|)Entropy = expected information =- (freq(Ci,S)/|S|) log (freq(Ci,S)/|S|) = info(S)

GAIN

Test X:

infoX (T) = |Ti|/T info(Ti)

gain(X) = info (T) - infoX(T)

Page 31: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

PROBLEM:What is best predictor to segment on?- windy or the outlook?

Page 32: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 33: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 34: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Problem with Gain

Strong bias towards test with many outcomes.

Example: Z = Name

|Ti| = 1 (each name unique)

infoZ (T) = 1/|T| (- 1/N log (1/N)) 0

Maximal gain!! (but useless division--- overfitting--)

Page 35: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Split

Split-info (X) = - |Ti|/|T| log (|Ti|/|T|)

gain-ratio(X) = gain(X)/split-info(X)

Gain <= log(k)

Split <= log(n)

ratio small

Page 36: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

• The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers–Result is in poor accuracy for unseen samples

• Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold• Difficult to choose an appropriate threshold

–Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees• Use a set of data different from the training data to decide which is the “best pruned tree”

• Approaches to Determine the Final Tree Size• Separate training (2/3) and testing (1/3) sets• Use cross validation, e.g., 10-fold cross validation• Use all the data for training• but apply a statistical test (e.g., chi-square) to estimate whether expanding or

pruning a node may improve the entire distribution• Use minimum description length (MDL) principle: • halting growth of the tree when the encoding is minimized

Page 37: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Gini Index (IBM IntelligentMiner)

• If a data set T contains examples from n classes, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T.• If a data set T is split into two subsets T1 and T2 with sizes N1

and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as

• The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

n

jp jTgini

1

21)(

)()()( 22

11 Tgini

NN

TginiNNTginisplit

Page 38: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Age Risk Tuple17 H 120 H 523 H 032 L 443 H 368 L 2

Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4

Family H 5

Age Car Type Risk23 Family H17 Sports H43 Sports H68 Family L32 Truck L20 Family H

Training set

Age Car

Attribute lists

Problem: What is the best way to determine risk? Is it Age or Car?

Page 39: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

SplitsAge Risk Tuple

17 H 120 H 523 H 032 L 443 H 368 L 2

Age < 27.5

Car Type Risk TupleFamily H 0Sports H 1Family H 5

Car Type Risk TupleSports H 2Family L 3Truck L 4

Age Risk Tuple17 H 120 H 523 H 0

Age Risk Tuple32 L 443 H 268 L 3

Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4

Family H 5

Group1 Group2

Page 40: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Histograms

For continuous attributes

Associated with node (Cabove, Cbelow)

to process already processed

Page 41: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 42: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 43: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 44: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 45: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 46: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 47: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 48: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 49: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 50: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of
Page 51: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

ANSWER

The winner is Age <= 18.5

Age Risk Tuple17 H 120 H 523 H 032 L 443 H 368 L 2

Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4

Family H 5

H

Y N

Age Risk Tuple20 H 523 H 032 L 443 H 368 L 2

Car Type Risk TupleFamily H 0

Sports H 2Family L 3Truck L 4

Family H 5

Page 52: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Summary

• Classification is an extensively studied problem (mainly in

statistics, machine learning & neural networks)

• Classification is probably one of the most widely used

data mining techniques with a lot of extensions

• Scalability is still an important issue for database

applications: thus combining classification with database

techniques should be a promising topic

• Research directions: classification of non-relational data,

e.g., text, spatial, multimedia, etc..

Page 53: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Association rules a* priori paper – student plays basketball example

Page 54: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Summary

Page 55: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Association Rules

• Market basket data: your ``supermarket’’ basket contains {bread, milk, beer, diapers…}

• Find rules that correlate the presence of one set of items X with another Y.– Ex: X = diapers, Y= beer, X Y with

confidence 98%– Maybe constrained: e.g., consider only

female customers.

Page 56: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Applications

• Market basket analysis: tell me how I can improve my sales by attaching promotions to “best seller” itemsets.

• Marketing: “people who bought this book also bought…”

• Fraud detection: a claim for immunizations always come with a claim for a doctor’s visit on the same day.

• Shelf planning: given the “best sellers,” how do I organize my shelves?

Page 57: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Association Rule: Basic Concepts

• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)

• Find: all rules that correlate the presence of one set of items with that of another set of items– E.g., 98% of people who purchase tires and auto

accessories also get automotive services done

Page 58: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Association Rule Mining: A Road Map

• Boolean vs. quantitative associations (Based on the types of values handled)

– buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%]

– age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]

• Single dimension vs. multiple dimensional associations (see ex. Above)

Page 59: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Road-map (continuation)

• Single level vs. multiple-level analysis– What brands of beers are associated with what brands of

diapers?

• Various extensions– Correlation, causality analysis

• Association does not necessarily imply correlation or causalityCausality: Does Beer Diapers or Diapers Beer (I.e., did the

customer buy the diapers because he bought the beer or was it the other way around)

Correlation: 90% buy coffee, 25 % buy tea, 20% buy both--- support is less than expected support = 0.9*0.25 = 0.225--

– Maxpatterns and closed itemsets– Constraints enforced

• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

Page 60: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Summary

Page 61: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Mining Association Rules—An Example

For rule A C:support = support({A C}) = 50%

confidence = support({A C})/support({A}) = 66.6%

The Apriori principle:Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Page 62: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Mining Frequent Itemsets: the Key Step

• Find the frequent itemsets: the sets of items that

have minimum support

– A subset of a frequent itemset must also be a

frequent itemset

• i.e., if {AB} is a frequent itemset, both {A} and {B} should be

a frequent itemset

– Iteratively find frequent itemsets with cardinality from

1 to k (k-itemset)

• Use the frequent itemsets to generate

association rules.

Page 63: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Problem decomposition

Two phases:

• Generate all itemsets whose support is above a threshold. Call them large (or hot) itemsets. (Any other itemset is small.)

How? Generate all combinations? (exponential!) (HARD.)

• For a given large itemset

Y = I1 I2 … Ik k >= 2

Generate (at most k rules) X Ij X = Y - {Ij}

confidence = c support(Y)/ support (X)

So, have a threshold c and decide which ones you keep. (EASY.)

Page 64: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Examples

Tid Items 1 {a,b,c} 2 {a,b,d} 3 {a,c} 4 {b,e,f}

Minimum support: 50 % itemsets {a,b} and {a,c}

Rules: a b with support 50 % and confidence 66.6 %

a c with support 50 % and confidence 66.6 %

c a with support 50% and confidence 100 %

b a with support 50% and confidence 100%

Assume s = 50 % and c = 80 %

Page 65: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

The Apriori Algorithm

• Join Step: Ck is generated by joining Lk-1with itself

• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

• Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

Page 66: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

The Apriori Algorithm — Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2

Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Page 67: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

How to Generate Candidates?

• Suppose the items in Lk-1 are listed in an order

• Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

• Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

Page 68: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Candidate generation (example)

C2 L2itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2 L2{1 2 3 }{1 3 5}{2 3 5}

C3

itemset{2 3 5}

Since {1,5} and {1,2} do not have enough support

Page 69: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Is Apriori Fast Enough? — Performance Bottlenecks

• The core of the Apriori algorithm:– Use frequent (k – 1)-itemsets to generate candidate frequent k-

itemsets– Use database scan and pattern matching to collect counts for the

candidate itemsets

• The bottleneck of Apriori: candidate generation– Huge candidate sets:

• 104 frequent 1-itemset will generate 107 candidate 2-itemsets

• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates.

– Multiple scans of database: • Needs (n +1 ) scans, n is the length of the longest pattern

Page 70: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Mining Frequent Patterns Without Candidate Generation

• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure– highly condensed, but complete for frequent pattern

mining

– avoid costly database scans

• Develop an efficient, FP-tree-based frequent pattern mining method– A divide-and-conquer methodology: decompose

mining tasks into smaller ones

– Avoid candidate generation: sub-database test only!

Page 71: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Construct FP-tree from a Transaction DB

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 0.5

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:

1. Scan DB once, find frequent 1-itemset (single item pattern)

2. Order frequent items in frequency descending order

3. Scan DB again, construct FP-tree

Page 72: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Summary

Page 73: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Constraint-based association mining

• Summary

Page 74: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Summary

Page 75: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Interestingness Measurements

• Objective measuresTwo popular measurements: support; and confidence

• Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting ifit is unexpected (surprising to the user); and/oractionable (the user can do something with it)

Page 76: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Criticism to Support and Confidence

• Example 1: (Aggarwal & Yu, PODS98)– Among 5000 students

• 3000 play basketball• 3750 eat cereal• 2000 both play basket ball and eat cereal

– play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.

– play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000

Page 77: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Criticism to Support and Confidence (Cont.)

• We need a measure of dependent or correlated events

• If Corr < 1 A is negatively correlated with B (discourages B)• If Corr > 1 A and B are positively correlated• P(AB)=P(A)P(B) if the itemsets are independent. (Corr =

1)• P(B|A)/P(B) is also called the lift of rule A => B (we want

positive lift!)

)(

)/(

)()(

)(, BP

ABP

BPAP

BAPcorr BA

Page 78: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Summary

Page 79: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Why Is the Big Pie Still There?

• More on constraint-based mining of associations – Boolean vs. quantitative associations

• Association on discrete vs. continuous data

– From association to correlation and causal structure analysis.

• Association does not necessarily imply correlation or causal relationships

– From intra-trasanction association to inter-transaction associations

• E.g., break the barriers of transactions (Lu, et al. TOIS’99).

– From association analysis to classification and clustering analysis

• E.g, clustering association rules

Page 80: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Summary

• Association rule mining – probably the most significant contribution from the

database community in KDD

– A large number of papers have been published

• Many interesting issues have been explored

• An interesting research direction– Association analysis in other types of data: spatial

data, multimedia data, time series data, etc.

Page 81: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Business Miner http://www.businessobjects.comClementine http://www.isl.co.uk/clem.htmlDarwin http://www.oracle.com/ip/analyze/warehouse/datamining/Data Surveyor http:// www. ddi. nl/DBMiner http://db.cs.sfu.ca/DBMinerDelta Miner http://www.bissantz.de Decision Series http://www.neovista.comIDIS http://wwwdatamining.comIntelligent Miner http://www.software.ibm.com/data/intelli-mineMineSet http://www.sgi.com/software/mineset/MLC++ http://www.sgi.com/Technology/mlc/MSBN http://www.research.microsoft.com/research./dtg/msbnSuperQuery http://www.azmy.comWeka http://www.cs.waikato.ac.nz/ml/wekaApriori: http://fuzzy.cs.uni-magdeburg.de/~borgelt/apriori/apriori.html

Some Products and Free Soft available for association rule mining

Page 82: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

K-menas clustering

Page 83: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

Birch uses summary information – bonus question

Page 84: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

STUDY QUESTIONS

Some sample questions on data mining part. You may practice by yourself. No need to hand in. 1. Given transaction table:

TIDList of items

T1 1, 2, 5

T2 2, 4

T3 2,3

T4 1, 2, 4

T5 1, 3

T6 2, 3

T7 1, 3

T8 1, 2, 3, 5

T9 1, 2, 3

1)if min_sup = 2/9, apply apriori algorithm to get all the frequent itemsets, show the step.2)If min_con = 50%, show all the association rules generated from L3 (the large itemsets contains 3 items).

Page 85: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

STUDY QUESTIONS

2. Assume we have the following association rules with min_sup = s and min_con = c: A=>B (s1, c1) B=>C (s2,c2) C=>A (s3,c3)

Show the probability of P(A), P(B), P(C), P(AB), P(BC), P(AC), P(B|A), P(C|B), P(C|A)Show the conditions we can get A=>C

Page 86: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

STUDY QUESTIONS

. Given the following table

Outlook Temp Humidity Windy Classsunny 75 70 Y Playsunny 80 90 Y Don'tsunny 85 85 N Don'tsunny 72 95 N Don'tsunny 69 70 N Playovercast 72 90 Y Playovercast 83 78 N Playovercast 64 65 Y Playovercast 81 75 N Playrain 71 80 Y Don'train 65 70 Y Don'train 75 80 Y Playrain 68 80 N Playrain 70 96 N Play

Apply sprint algorithm to build decision tree. (The measure is gini)

Page 87: BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of

STUDY QUESTIONS

4. Apply k-means to cluster the following 8 points to 3 clusters. The distance function is Euclidean distance. Assume initially we assign A1, B1, and C1 as the center of each cluster respectively. The 8 points are : A1(2,10), A2(2,5), A3(8,4) B1(5,8) B2(7,5), B3(6,4), C1(1,2), C2(4,9) Show - the three cluster centers after the first round execution.- the final three clusters.