AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen...

Preview:

Citation preview

AR mining

Implementation and comparison of three AR mining algorithms

Xuehai Wang, Xiaobo Chen, Shen chen

CSCI6405 class project

AR mining

Outline

• Motivation

• Dataset

• Apriori based hash tree algorithm

• FP-tree algorithm

• Conclusion

• Reference

AR mining

Motivation

• Make the time of generating rules as shot as possible!

• To understand the three algorithms– Apriori algorithm– Apriori with hash tree algorithm– FP-tree algorithm

• Learn how to improve an algorithm

AR mining

Dataset• IBM dataset generator

– Can set item number– Can set minimal support– Can set dataset size

1 2 5 8 9

2 3 4 6 7 12

Tid item

AR mining

Apriori principle

• Apriori principle– A candidate generation-and-test Approach [4]– Given a frequent itemset, its subset must be fre

quent– A set is infrequent, its super set will not be gene

rated and tested

• But there is still some places can be improved– Count the support– I/O scan times

AR mining

Apriori Hash Tree Alg

• Candidate K-itemset size is l• There is n transactions• Average transaction size is m• Calculate support count:

– Original Apriori Alg:

– With hash tree: O( n.log(l).(mk) )

)( mklnO

)log( mklnO

AR mining

Apriori Hash Tree Alg

• Candidate is stored in a hash tree structure

Tid Items

1 1 2

2 1 3 6

3 1 2 3

4 2 4

5 2 3 6

6 5 6

1-itemset candidate hash tree

1(1)2(1)1(2)

3(1)

1(2) 3(1)2(1)

AR mining

Apriori Hash Tree Alg

Tid

Items

1 1 2

2 1 3 6

3 1 2 3

4 2 4

5 2 3 6

6 5 6

2(4)5(1) 6(3)

1(3) 3(3)4(1)

1itemset , Min support = 2

AR mining

Apriori Hash Tree Alg

Tid

Items

1 1 2

2 1 3 6

3 1 2 3

4 2 4

5 2 3 6

6 5 6

2 3(2)2 6(1)

1 3(2)1 2(2)

3 6(2)

1 6(1)

2 itemset, Min support = 2

3 itemset, Min support = 2

1 2 3(1)

AR mining

FP-tree

• Since the mining dataset is always very huge, it’s impossible to read all transactions into computer memory all in once.

• But I/O scan is very time consuming.

• FP-tree algorithm will try to suite all information from the dataset into computer memory, hence only need to scan I/O two times.

AR mining

FP-tree

• FP-tree algorithm and implementation– By Xiaobo Chen

AR mining

FP-tree (Frequent Pattern Tree)

• Mining frequent pattern without candidate generation

• Divide and conquer methodology: decompose mining tasks into smaller ones

AR mining

FP-tree (Merits of FP-tree algorithm)

• Make most use of common shared prefix

• Complete and compact

All information of a transaction is

stored in a path

The size is constrained by the data set consequently, the longest path corresponds to the longest

pattern

The compact ratio: over 100

AR mining

FP-tree (Construction of FP-tree)

• TID freq. Items bought

• 100 {f, c, a, m, p}

• 200 {f, c, a, b, m}

• 300 {f, b}

• 400 {c, p, b}

• 500 {f, c, a, m, p}

min_support = 3Item frequency f 4c 4a 3b 3m 3p 3

f:1

c:1

a:1

m:1

p:1

root

AR mining

FP-tree (construction (Cont’d))TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}

f:2

c:2

a:2

m:1

p:1

b:1

m:1

root

AR mining

FP-tree construction (Cont’d)• TID freq. Items bought

• 100 {f, c, a, m, p}

• 200 {f, c, a, b, m}

• 300 {f, b}

• 400 {c, p, b}

• 500 {f, c, a, m, p}

min_support = 3Item frequency f 4c 4a 3b 3m 3p 3Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

f:4

c:3

a:3

m:2

p:2

b:1

m:1

b:1

c:1

b:1

p:1

root

AR mining

FP-tree (Mining Frequent Patterns Using the FP-tree)

• General idea (divide-and-conquer)– Recursively grow frequent pattern path using the FP-

tree

• Method – For each item, construct its conditional pattern-base,

and then its conditional FP-tree

– Repeat the process on each newly created conditional FP-tree

– Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)

AR mining

FP-tree (Mining Frequent Patterns Using the FP-tree)

Conditional pattern base for p

fcam:2, cb:1

f:4

c:3

a:3

m:2

p:2

c:1

b:1

p:1

p

• Start with last item in order (i.e., p).

• Follow node pointers and traverse only the paths containing p.

• Accumulate all of transformed prefix paths of that item to form a conditional pattern base

root

Constructing a new FP-tree based on this pattern base leads to only one branch c:3Thus we derive only one frequent pattern cont. p. Pattern cp

AR mining

FP-tree (Mining Frequent Patterns Using the FP-tree)

• Move to next least frequent item in order, i.e., m

• Follow node pointers and traverse only the paths containing m.

• Accumulate all of transformed prefix paths of that item to form a conditional pattern base

Conditional pattern base for m

fca:2, fcab:1

f:4

c:3

a:3

m:2

m

m:1

b:1

Constructing a new FP-tree based on this pattern base leads to path fca:3From this we derive frequent patterns fcam, fcm, cam, fm, cm, am

root

AR mining

FP-tree (Conditional Pattern-Bases for the example)

EmptyEmptyf

{(f:3)}|c{(f:3)}c

{(f:3, c:3)}|a{(fc:3)}a

Empty{(fca:1), (f:1), (c:1)}b

{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m

{(c:3)}|p{(fcam:2), (cb:1)}p

Conditional FP-treeConditional pattern-baseItem

AR mining

FP-tree (Why is Frequent pattern Growth fast?)

• Performance studies show that

FP-growth is an order of magnitude faster than

Apriori, and is also faster than tree-projection

• Reasoning:

– No candidate generation, no candidate test

– Use compact data structure

– Eliminate repeated database scan

– Basic operation is counting and FP-tree building

AR mining

FP-tree: Expected result: FP-growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime(s

ec.)

D1 FP-grow th runtime

D1 Apriori runtime

AR mining

Conclusion

• FP-tree is faster than other two algorithms.

• Apriori as well as hash tree algorithms are easier to implement.– We can easily combine them with other

methods or tools. (i.e. distributed parallel computing).

• The parameter of dataset is very important too.– Density, size, min support …

AR mining

References

• [1] Jiawei Han and Micheline Kamber: "Data Mining: Concepts and Techniques ", Morgan Kaufmann, 2001

• [2] Jiawei Han, Jian Pei, Yiwen Yin: Mining Frequent Patterns without Candidate Generation, ACM SIGMOD, 2000

• [3] N.Mamoulis, Advanced Database Technologies (Slides)

• [4] Jiawei Han and Micheline Kamber. Data Mining - Concepts and Techniques. MorganKaufmann Publishers, 2001.

Recommended