dm tutoria

8/8/2019 dm tutoria

1/116

Tutorial PM-3High Performance Data MiningVipin Kumar (University of Minnesota)Mohammed Zaki (Rensselaer Polytechnic)

3 0 9

8/8/2019 dm tutoria

2/116

High Performance Data MiningVipin KumarComputer Science Dept.University of MinnesotaMinneapolis, MN, [email protected]/-kumar

Mohammed J. ZakiComputer Science Dept.Rensselaer Polytechnic InstituteTroy, NY, USA.zaki @cs.rpi.eduwww.cs.rpi.edu/-zaki

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 11

Tuto r i a l ove rv iew [ Overv iew of KDD and data mining I Parallel design space Classification Associations Sequences Clusteringm Future directions and sum ma ryHigh Performance Data Mining (Vipin Kum ar and Mohamm ed Zaki) 21

3 1 1

8/8/2019 dm tutoria

3/116

Overview of data mining

What is Data Mining/KDD? Why is KDD necessary The KDD process Mining operat ions and methods Core issues in KDD

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 3 1

What is data mining?

The iterat ive and interact ive process ofdiscovering valid, novel, useful, andunderstandable pat terns or models inMassive databases

IHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4 I

3 1 2

8/8/2019 dm tutoria

4/116

W hat is data mining? Val id : genera l ize to the fu ture Nove l : wh at we don ' t know Usefu l: b e ab le to take som e ac t ion Un ders ta ndab le : lead ing to ins igh t I terat ive: takes mult ip le passes In teract ive: hum an in the loop

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 5 I

Data mining goals Predic t ion

- W h a t ?- O p a q u e

Descr ip t ion- W h y ?- T r a n s p a r e n t

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 6 I3 1 3

8/8/2019 dm tutoria

5/116

Data mining operations Ver i f icat ion dr iven

- Validating hypothesis-Q ue ry in g and reporting (spreadsheets, pivottables)- M u l t i d i m e n s i o n a l analysis (dimensionalsummaries); On Line Analytical Processing- Statistical ana lysis


Data mining operations D isc ove ry dr iven

-Exploratory data analysis- Predictive modeling- Database segmentation- Link analysis- Deviation detection


8/8/2019 dm tutoria

6/116

KDD Process

High Performance Data M ining (Vipin Kumar and Mohammed Zaki) 91

Data mining p rocess Understand application domain

- Pr ior knowl edge, user goals Create target dataset

- Se lect data, focus on subsets Dat a cleaning and transformation

- Re mo v e no ise , out l ie rs , miss ing va l ues- Se lec t fea tures , reduce d imens ions

Performance Data Kumar and Mohammed 10 ]igh Mining (Vipin Zaki)3 15

8/8/2019 dm tutoria

7/116

Data mining p rocess Apply data mining algorithm

-Associations, sequences, c lass i f icat ion,clustering, etc.

Interpret, evaluate and visualize patterns-W ha t ' s new and in terest ing?- I te ra te if needed

Man age discovered know ledge-C l o s e t he loopHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 11 I

W hy M ine Data?ViewPoint . . .

Commercial Lots of data is being collected andwarehoused.d Computing has become affordable. Competitive Pressure is Strong

- Provide better, cus tom ized services for anedge.-Information is becoming product in i ts own

right.High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 12 I

3 1 6

8/8/2019 dm tutoria

8/116

W hy Mine Data?Scientific V iewpoint... Data col lec ted and s tored at eno rmo us spee ds(Gby te /hour )

- r emo te sens or on a sa te l l i t e- t e l e s c o p e s c a n n i n g t h e s k i e s- m i c r o a r r a ys g e n e r a t in g g e n e e x p r e s s i o n d a t a- s c i en t i fi c s im u l a t i o n s g e n e r a t i n g t e r a b y t e s o f d a t a

Tradi t ional techniq ues are infeas ib le for raw data Data mining for data reduct ion. .

- c a t a l o g i n g , c l a s s i fy i n g , s e g m e n t i n g d a t a- H e l p s s c i e n t i s t s i n H y p o t h e s i s F o r m a t i o n

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 1 3 1

Origins of Data Mining Draws ideas from machine learning/AI,

pattern recognition, statistics, databasesystems, and da ta visualization. Tradit ional Techniques may beunsuitable

- E n o r m i t y o f d ata-H ig h Dimens ional ity o f data- Heterogen eous, Distributed nature of data


8/8/2019 dm tutoria

9/116

Data mining methods Predictive modeling (classification,regression) Segmentation (clustering) Dependency modeling (graphicalmodels, density estimation) Summarization (associations) Change and deviation detection

High Performance Data Mining (Vipin Kumar and Mohammed Zaki)

Data mining techniqu es Association rules: detect sets of attributes thatfrequently co-occur, and rules among them, e.g.90% of the people who buy cookies, also buymilk (6 0% of all grocery shoppers bu y both) Seq uence mining (categorical): discov er

sequences of events that commonly occurtogether, .e.g. In a set of DNA sequencesACGTC is fol lowed by GTCA after a gap of 9,with 30% probabil i tyHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 16 I

3 1 8

8/8/2019 dm tutoria

10/116

Data mining techniques Classif ication and regression: assign a new datarecord to one of several predefined categories or

classes. Regression deals with predicting real-valued fields. Also called supervised learning.

Clustering: partit ion the dataset into sub sets orgroups such that elements of a group share acommon set of properties, with high within groupsimilarity and small inter-group similarity. Alsocalled unsupervised learning.

High Performance Data Mining (Vipin Kumar and Mohamm ed Zaki) 17 I

Data mining techniques Similarity search: given a datab ase of objects,and a "query" object, f ind the object(s) that arewithin a user-defined distance of the queriedobject, or f ind al l pairs within some distance of

each other. Deviation detection : find the record(s) that is(are) the most different from the other records,i.e., f ind al l outl iers. These may be thrown awayas noise or may be the "interesting" ones.

High Performance Data Mining (Vipin Kumar Zaki) IIand Mohammed 183 1 9

8/8/2019 dm tutoria

11/116

Data mining techniqu esMany other methods, such as- Neural networks

- Gene t ic a lgor ithms-H idden Markov mode l s- Time ser ies- Bayesian networks- Soft computing: rough and fuzzy sets

High Performan ce Data Mining (Vipin Kumar and Moham med Zaki) 19 I

Main challenges for KDD Sca lab i l i t y

- Eff ic ient and suff ic ient sampl ing- In-memory vs . d isk-based process ing- High performance comput ing

Au t o m at io n- Ease of use- Using prior knowledge

IHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 20 I

3 2 0

8/8/2019 dm tutoria

12/116

Tutorial overview Ove rview of KDD and data mining Parallel des ign spa ce I C lass i f ica t ion Associations Sequences Clustering Future direct ions and summaryI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 21 1

Speeding up data mining Data oriented approach

- D i s c r e t i z a t i o n- F e a t u r e s e l e c t i o n- F e a t u r e c o n s t r u c t i o n ( P C A )- S a m p l i n g Methods or iented approach- E f f ic i e n t a n d s c a l a b l e a l g o r i t h m s

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 22 I32 1

8/8/2019 dm tutoria

13/116

Speeding up data mining (contd.) Methods oriented approach (contd.)

- P a r a l l e l data mining Tas k o r con t ro l pa ra l l e l i sm Da ta pa ra l l e l ism Hybr i d pa ra l l e l i sm

-Distr ibuted data mining Vo t i ng Me ta - l ea rn i ng , e t c .


N e e d f or P a ra l le l F o r m u l a t i o n s Need to handle very large datasets. Memo ry limitations of sequen tial computerscause sequential algorithms to make multipleexpensive I/0 passes over data. Need for scalab le, efficient (fast) data miningcomputations

- g a i n c o m p e t i t i v e a d v a n t a g e .- H a n d l e l a r g e r d a t a f o r g r e a t e r a c c u r a c y in s h o r t e r

t imes .High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 24 I

3 2 2

8/8/2019 dm tutoria

14/116

Parallel Design Space Parallel architectures

- Dis t r i bu ted memory- Shared d i sk- S h a r e d me mo r y- H y b r i d c lus te r o f SMPs

Task and data paral lel ism Static and dynamic load balancing Horizontal and vertical data layout Data and concept partitioning


Parallel Hardware Distr ibuted-memory machines

- Each process or has loca l me mo ry and d isk- Communica t ion v ia message-pass ing- Hard to program: expl ic i t data dist r ibut ion- Goa l : min imize commun ica t ion

Shared-memory machines- Shared g loba l address space and d isk- Comm un ica t ion v ia shared mem ory var iab les- E ase of program ming- Goal : maximize loca l i ty , min imize fa lse shar ing

Current trend: Cluster of SMPsHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 26 I

3 2 3

8/8/2019 dm tutoria

15/116

Distributed Memory Architecture(Shared Nothing)

I n t e r c o n ne c t i o n N e t w o r k ~

1J

D


DM M: Shared Dis k Architectu re~ I n t e r c o n n e c t i o n N e t w o r k, 0 , mW / )I Performance Data Kumar and Mohammed 28 Iigh Mining (Vipin Zaki)

3 2 4

8/8/2019 dm tutoria

16/116

Shared Memory Architecture(Shared Everything)

8/8/2019 dm tutoria

17/116

T a s k v s . D a t a P a r a l l e l i s m D a ta Para l l e l i s m

- D a t a part i t ioned among P processors- Each processor performs the same work

on local partition T a s k P a r a l le l i s m

- Each processor performs dif ferentcomputat ion

- Data may be (select ively) repl icated orpartit ionedHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 3 1 1

T a s k v s . D a t a P a r a l l e l i s mPO PI

Task Parallelism

Uniprocessor @+1

P2

Data Paral le l ismI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 32 I

3 2 6

8/8/2019 dm tutoria

18/116

Sta t i c v s . Dyna mic Lo a d Ba la nce Static Load Balancing

- Work is initially divided (heuristically)- N o sub sequent computation or data

movement Dynamic Load Balancing

- Steal work from heavily loaded processors- Reass ign to l ightly loaded process ors- Important for irregular computation,heterogeneous environments, etc.


Horizontal /V ert ica l Data FormatHor izonta l Ver t i ca l

2 N) Ma nie d 1CCK N:)

4 Yes Ma de d 12CK I~b i

6 N:) Manie d B:K I~b8 N~ 8 n~ 8EK Yes

10 I~ @rge ~K Yes

High Performance Data Mining (Vipin Kumar and Mohamme d Zaki) 34 I

3 27

8/8/2019 dm tutoria

19/116

D ata and concep t part it ioning Shared

- S M P o r s h a r e d d i s k a r c h i t e c t u r e s Repl icated

- P a r t i a l l y o r t o t a l l y Part i t ioned

- R o u n d - r o b i n p a r t i t i o n i n g- H a s h p a r t i t i o n i n g- R a n g e p a r t i t i o n i n g

High Performance Data Mining (Vipin Kumar and Mohammed Zaki)

Tutorial overview Overv iew of KDD and data min ing Paral le l design space 1Classi f icat ion A s s o c i a t i o n s Sequences Cluster ing Future d i rect ions and summaryI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 36 J

3 2 8

8/8/2019 dm tutoria

20/116

8/8/2019 dm tutoria

21/116

Classification learning

Training set: set of examples, whereeach example is a feature vector ( i .e. , aset of (attr ibute,value) pairs) with itsassociated class. Model bui l t on this set.

Test set: a set of examples disjoint fromthe training set, used for test ing theaccuracy of a model.

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 39

Classif ication Models Som e m odels are bet ter than others

- A c c u r a c y- U n d e r s t a n d a b i l i t y

Models range f rom easy to understandto incom prehensib le- D e c i s i o n t r e e s Easier- R u l e i n d u c t i o n- R e g r e s s i o n m o d e l s- N e u r a l n e t w o r k s Harder

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4O I3 3 0

8/8/2019 dm tutoria

22/116

D e c i s i o n T r e e C l a s s i f i c a t i o nSplitting Attributes

Yes /J Refund L',Ao /J ,,,,,r=

< 8 0 ; T ax ln o ~ 0 K ~ J

The splitting attribute at a node isdetermined based on the Gini index.

HighPerformanceDataMining VipinKumarand MohammedZaki) 41 ]

F r o m T r e e to R u l e sY e s ~ Refund [ . ~

~-MarStSingle, D~ercedTaxlnc< 8OK / ~,= 80K

MarriedW

1) Re fu nd = Yes ~ NO2 ) R e f u n d = N o a n dM a r S t i n {S ing le , D ivo rced}a n d T a x l n c < 8 0 K ~ N O

3 ) R e f u n d = N o a n dM a r S t i n {S ing le , D ivo rced}a n d T a x l n c > = 8 0 K = , Y E S

4 ) R e f u n d = N o a n dM a r S t i n {Mar r i e d } ~ NO

I High DataMining VipinKumarand Mohammed aki) 42 1erformance33 1

8/8/2019 dm tutoria

23/116

Classification algorithm Bu ild tree

- S ta r t with data at root node- S e l e c t an at t r ibute and formulate a

logical test on attr ibute-B ra n ch on each ou tcome o f the test,

and m ove subse t o f ex am p lessat is fy ing that outcome tocor respond ing ch ild node


Classification algorithm- Recurse on each child node- Repeat until leaves are "pure", i.e., haveexample from a single class, or "nearly

pure", i .e., majority of examples are fromthe same class

Pru ne tree-Remove subt rees that do not improveclassif ication accuracy- Avoid over-fitting, i.e., training set specificartifacts

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 44 i3 3 2

8/8/2019 dm tutoria

24/116

B uild t ree Evaluate spli t-points for al l attr ibutes Select the "best" point and the "winning"

attribute Split the data into two Breadth/depth-f i rst construct ion CRITICAL STEPS:

- F o r mu la t i o n o f g o o d sp l it t e s t s- S e l e c t i o n m e a s u r e f o r a t t r ib u t e s

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 45 [

How to capture good spl i ts? Oc cam 's razor: Prefer the simplest

hypothesis that f i ts the data Minimum message/descript ion length

- Da ta se t D- H y p o t h e s e s H 1 , H 2 , .. . , H x d e s c ri b i ng D- M M L ( H i ) = M l e n g t h ( H i ) + M l e n g t h ( D I H i )- P ic k H k w i th m i n i m u m M M L

Mlength given by Gini index, Gain, etc.High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 46 I

333

8/8/2019 dm tutoria

25/116

Tree pruning u sing M DL Data encoding: sum classif icat ion errors Model encoding:

- E ncode the tree structure- Encode the split points

Pruning: choose smallest length opt ion-Conver t to lea f- P r u n e left or right child- Do nothing


Hu nt's M ethod Attributes: R efund (Yes, No), Marital Status(Single, Married, Divorced), Taxable Income Class: Cheat, Don 't Che at


8/8/2019 dm tutoria

26/116

W hat's really happening?Mari ta l St a t u s

+ 0O o C

+ 00 + 00+ +

+ +Marrie, + + 0

0 0 00 0 0 0 0+0 0 0

0 0 0 00O

O

Cheat IDon' t C heat I

I -'Income< 80K IHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 49 t

Finding good sp lit points Us e Gini index for partition purity

G i n i ( S ) = l - ~ p ~i=1

G i n i ( S l , S 2 ) = ~ G i n i ( S l ) + n ~ G i n i ( S 2 )n n

If S is pure, Gini(S) = 0 F ind split, point with minimum Gini Only need class distributions

High Performance Data Mining (Vipin Kumar and Mohamm ed Zaki) 50 I335

8/8/2019 dm tutoria

27/116

Finding good sp lits pointsMa rit t l Status Mar i l' + 0

O + 0 4 - 00 + 00+%,

4--++ + (

) 0 0

o o 00 0 0 000

O

~ Cheat~,Don't Cheat

d Status' 0 0 04- 00 + 0 0 0 0 0 0 o

4 - 0 0 0 O + 0 00 a 0 a e~* * . * o

+ T ~ o+I n c o m eN O Y E S N O Y E S

L e f t 1 4 9 T o p 5 2 3Ri gh t 1 1 8 Bottorr 10 4

N c o m e

G ini(split) = 0.31 G ini(split) = 0.34High Performance Data Min ing (Vip in Kumar and Mohammed Zaki ) s , I

Categorical Attributes:Computing Gini IndexFor each distinct value, gather counts foreach class in the dataset

Us e the count matrix to mak e decisionsMulti-way split II

ITwo-way split(find b est partition of values)

; a r Ty p (

2 I 2I G i n i I o , 4 : t eI

High Per fo rmance Da ta Min ing (V ip in Kumar and Moham med Zak i ) 52

33 6

8/8/2019 dm tutoria

28/116

Continuous Attr ibutes:ComputingGini Index . . . For efficient computation: for each attribute,- Sort the attr ibute on values

- L inearly scan thes e values, each t ime updating the countmatrix and computing gini index

- Cho os e the spli t position that has the least gini index

Sorted ValuesSplit Position.=

H

C 4 . 5 Simpl e depth-f irst construct ion. Sorts Cont inuous A t t r ibutes at each node. Ne eds ent ire data to f it in me mo ry . Unsuitab le for Large Datas ets .

- Needs out -of -core sort ing. Classification Accuracy shown to

improve when entire datasets areused!

I High Mining (Vipin Mohammed Zaki) 5 4 Ierformance Data Kumar and3 3 7

8/8/2019 dm tutoria

29/116

SPRINT [Shafer,Agrawal,Mehta]Attr ibute Lis ts :


SPRINT (contd.) The arrays of the continu ous attr ibutes are pre-sorted.

- The so r ted o rder i s ma in ta ined dur ing each sp li t . The classif ication tree is grow n in a breadth -f irst fashion. Class information is club bed with each attr ibute l ist. Attr ibute l ists are physically spl i t am ong nodes. Spli t determ ining phase is just a l inear scan of lists at each

node. Hash ing sch eme used in spl i t t ing phase.

- t i ds o f the sp li t t i ng a t t r i bu te a re hashed wi th the t ree node as thekey.

lookup table- rema in in g a t t r i bu te a r rays a re sp l i t by query ing th i s hash s t ruc tu re .

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 5 6 ]3 3 8

8/8/2019 dm tutoria

30/116

SPRINT Di sadvantages Siz e of hash tabl e is O (N) for top levels

of the tree. If hash tab le does not fit in memory

(mostly true for large datasets), thenbuild in parts so that each part fits.- Mu l t ip le expens ive I /O pass es over the

entire dataset .I High Performance Data Mining(Vipin Kumar and Mohammed Zaki) 57 I

C onstruct ing a De cis ion T ree inParallel

/1

oJID

m ca tegor ica l a t t r ibutes

Good Bad

Partitioning of dataonly- global reduction p e rnode is required- large number ofclassification t r e enodes gives highcommunication cost

I Performance Data Kumar and Mohammed 58 Iigh Mining ( V i p i n Zaki)3 3 9

8/8/2019 dm tutoria

31/116

C onstruct ing a D ec i s ion T ree inParal le l

l 0 , 0 0 0 t r a i n in g r eco r d s

7,0q ds -Partitioning ofclassification tree nodes

n a t u ra l c o n c u r re n c yl o a d i m b a l a n c e a s t h ea m o u n t o f w o r k a s s o c i a t e dw i t h e a c h n o d e v a r i e s

- c h i ld n o d e s u s e t h e s a m ed a t a a s u s e d b y p a r e n t n o d e

- loss of locality- high data moveme nt cost


ChallengesClassifier

in Constructing Parallel Partit ioning of data only

- l a rg e n u m b e r o f c l a s s if i c a ti o n t r e e n o d e s g i v e s h ig h c o m m u n i c a t i o nc o s t

Partitioning of class ification tree nodes- n a t u ra l c o n c u r re n c y- l o a d i m b a l a n c e a s t h e a m o u n t o f wo rk a s s o c i a t e d w i t h e a c h n o d e

v a r i e s- c h il d n o d e s u s e t h e s a m e d a t a a s u s e d b y p a re n t n o d e

- loss of locality- high data movement cost

How do we efficiently perform the com putation inparallel?I High Mining (Vipin Zaki) 60 Ierformance Data Kumar and Mohammed

3 4 0

8/8/2019 dm tutoria

32/116

Synchronous Tree Construction ApproachProc 0 Proc 1 Proc 2 Proc 3

~ - i i i ~ i i i ~ : : : ~ + N requireddataovement isClas s Distribution Information

II - - Load imbalance

y i c a n b e e l im i n a t e d b y b r e a d t h -Proc 0 Proc 1 P~oc 2 Proc 3 f ir s t e x p a n s i o n

I igh communication cost. b e c o m e s t o o h ig h in l o w e r p a r ts' : , ~ . . . . . . . . . . . . . . . . " . . . . : : o f t h e t r e e" - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - - "

C l a s s D i s t r ib u t i o n I n f o r m a t i o n

Partition Data Across Processors

I H i g h P e r f o r m a n c e D a t a M in i n g V i p i n K u m a r a n d M o h a m m e d Z a k i ) 6 1 I

Partitioned Tree Construction Approachi . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . ; ~ . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . = =

P r o c O Proc l Proc2 Proc3I!t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . :ProcO Proc l Proc2 Proc3

tprocO prowl Proc~ Proc3I

Partition Data a n dN o d e s- I - Highly concurrent

High communication costdue to excessive datamovements

- Load imbal ance

H i g h P e r fo r m a n c e D a t a M in i n g V i p in K u m a r a n d M o h a m m e d Z a k i) 6 2 I3 4 1

8/8/2019 dm tutoria

33/116

Hybrid Parallel Formu lation

(Computation Frontier at depth 3

]

Partition 1 Partition 2

Synchronous TreeConstruction Approach Partitioned TreeConstruction ApproachHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 6 3 1

Load Balancing

CP a r t it io n 1 P a r t i t i o n S t e p 1 : e x c h a n g e d a t a b e t w e e n S t e p 2 : l o a d b a l a n c e w i th i n

p r o c e s s o r p a r t i ti o n s p r o c e s s o r p a r t i ti o n s

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 6 4 I3 4 2

8/8/2019 dm tutoria

34/116

8/8/2019 dm tutoria

35/116

Speedup ComparisonParallel Algorithms

of the Three

~ e d u p for 0.8 malon examp~s , partitioned - " -+ - ' , hybr i d - " -o - ' , synch arsus - " -x-". . . . . . . /4

2 4 6 N o O I P r~ e s s o I O 1 2 1 4 1 6

Speedup for 1.6 rnllione x a n ~ , part~ioned - " ~' . h yb ri d - " -c -' . s ' ~ - " - x- "

4 6 8 10 12 14 16NO O~ Promisors

0.8 million examples 1.6 million examplesH i g h P e r f o r m a n c e D a t a M in i n g ( V ip i n K u m a r a n d M o h a m m e d Z a k i ) 6 7 I

Splitting C riterion V erification in theHybrid Algorithm

Spl i tt ing Cr i t e r ion Ra t io = ~ C ommunica t ion Cos tMov ing Cos t + L oad Balanc ing~J~ume~ ot ~o~,ng at dJr~mnt v~o .q o~ nu"bo, ~ 0.8 mlWon 1~u~qplea7O0

600

S00

30O

2OO

,%

// ~// "/ "

-2 0 2 4 6 8 10bg2(Splmng C, ' i l e .~ R~o), x , .O > ncx, .1

~ IO ~olllling at dllf enmf ~mlul~ ol m no , 1 6 ~ 1 6 m il t on e xa nl pl es

' i l / J / ~/ , //-2 0 2 4 6 6 10Io~( Sp4111ngC411eriaRatk )), x . .O -> ~ 10 .8 mi l l i o n ex a mp le s on 8 pro ce s so rs 1 . 6 mi l l i o n ex a mp le s on 16 pro ce s so r s

I H i g h P e r f o r m a n c e D a t a M in i n g ( V ip i n K u m a r a n d M o h a m m e d Z a k J ) 6 8 ]3 4 4

8/8/2019 dm tutoria

36/116

Speedu p of the Hybrid Algorithm w ithDifferent Size Data Sets

1 0 0

4 0

20

Speedup c~rves for di f ferent sized d a t a . s e t s0.8 million exam ples --e---1,6 mi l lion examples , .~-.3.2 mi l l ion examples6.4 mi l l ion examples12.8 minion examples25.6 mi l l ion examples - ix-, -

. . . . . -....-''" ....A

,-" .A-. ... ... . . ' / ". .~ - . ~ '. -" ~ . = ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

i ) , - . . . . . . .

i ! i i i i20 4 0 6 0 80 100 120Number of proceesols 1 4 0

H i g h P e r f o r m a n c e D a t a M i n i n g ( V i p i n K u m a r a n d M o h a m m e d Z a k i ) 6 9 I

Scaleup of the Hybrid AlgorithmR u n U m e s o f o u r a lg o r i t h m f o r 5 0 K e x a m p l e s a t e a c h p r o c e s s o r

90 ~ I !

80 . . . . . . . . . : . . . . . . . . : . . . . . . . . . . . . . ! . . . . . . . . . ~ . . . . . . . . . . . . . . . . . : . . . . . ~ . . . . . . .

7 0 . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

~ 5 0 . .. .. .. .. .. .. .. .. .. .. .. .. .. .. = . .. .. .. .. . ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ~ . .. .. .I :t ~ 4 0 . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 ( ] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; . . . . . . . . . . , . . . . . . . . .

1(~ . . . . . . . . . , . . . . . . . ~ . . . . . . . i . . . . . . . . ~ . . . . . . . ~ . . . . . . . . . i . . . . . . . :: . . . . ~ . . . .! i : i i i ii I ; I I i i

0 1 2 3 4 5 6 7 8 9N u m b e r o f Pr o c e s s o m

P e r f o r m a n c e D a t a M i n i n g ( V i p i n K u m a r a n d M o h a m m e d Z a k i ) 7 0 Ii g h

3 4 5

8/8/2019 dm tutoria

37/116

Summary of Algorithms forCategorical Attributes Synch ronous Tree Construction Approach

- no data m oveme nt required- high comm unicat ion cost as tree becomes bushy

Partitioned Tre e Construction A pproach- processors work independen tly once partit ioned com pletely- load imbalance and high cost of data moveme nt

Hyb rid Algorithm- combines good features of two approaches- adapts dynamically according to the size and shape of trees


Handling Continuous Attributes Sort continuous attributes at each node of the

tree (as in C4.5). Expensive, henceUndesirable!

Discretize continuous attributes- C L O U D S (Alsabti, Ranka, and Singh, 1998)- S P E C (Srivastava, Han, Kumar, and Singh, 1997) Us e a pre- sort ed list for each cont inuous

attributes- S P R I N T ( S h a f e r , Agrawal, and Mehta, VLDB'96 )- S c a l P a r C (Joshi, Karypis, and Kumar, IPPS'98)

!High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 7 2 I3 4 6

8/8/2019 dm tutoria

38/116

Design Approach Goal: Scalability in both Runtime and

Memory Requirements. Parallel izat ion overhead: To = P Tp - Ts T o ~ O (Ts) for scalab ility. Per process or

overhead should not exceed O(Ts/P). Two comp onents of To:

- L o a d I m b a l a n c e .- C o m m u n i c a t i o n t i m e .

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 73 [

Load BalancingB Iv a l u e rid cid

5 1 010 8 lPO 15i 4 02o I 6 025 2 0S P " - ~ - :o,

PI 3 5 3 17 0

4 5 5 l

B 2value rid cid

6 o l l12 1 0

PO 18 8 124 6 0

_ 3 0 4 03 6 3 1

P l 4 2 2 048 5 15 4 7 0

SP : s p l i t p o i n trid : record idcid : class label

BI B2SP-- -~ 5 1 0 12 1 0 PG10 8 I 18 8 1 30 0 l 6 0 1

35 3 1 36 3 1,15 4 0 PO 24 6 0 1 40 7 0 48 5 1PO 20 6 0 30 4 0- i 2 5 2 0 . p ~ 4 2 2 0 4 5 5 1 54 7 0P1 .~ ~ t ~ .


8/8/2019 dm tutoria

39/116

P a r a l l e l S P R I N T At tribute lists are partitioned amo ng process ors

- Ea c h proc e ssor ge ts N /p records of each a t t r ibute l is t A ttribute lists of continuous attributes are pre- sort ed Split Determining Phas e

- C a te gor ic a l a t t r ib u te s Local construction of count matrices followed b y a reduction to

add them up- C on t inuou s a t t r ib u te s

Prefix-scan followed by local Gini computation s followed by agini index reduction to find the maximum point

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 75 ]

Parallelizing Split DeterminationPhase Easy.

P0

PIP2

ICount Matrices I AgeCarType local global lcalvalue rid cid I value rid cid ~ 0 1S p o O l 0 1 t ~ I 19 2 1 ~Fa m 1 i 01 Faro Faro i 28 8 0Fam 2 1 _ I -Fam 3 0 0 I 33 1 0 [ ~Sp o [ - ~ Parallel iS p o 4 1 Fam lll01 Reduction i P1 38 3 1 RFamSP 5 0 _6 r-m-m I - 40 6 1" - L R ~ i0 1 I 5 0 4 0

S p o ~ _ ~ P2 5 8 7 1Fa m 7 1 Fam t._LLeJ IFaro 8 0 i 70 0 0Categorical Attribute

Count Matrices global

Parallel o 1P,o. iScan o 1

Continuous A ttr ibute

High Performance Data Mining (Vipin Kumar and Zaki) J|Mohammed 763 4 8

8/8/2019 dm tutoria

40/116

8/8/2019 dm tutoria

41/116

8/8/2019 dm tutoria

42/116

Getting the required entries of thehash table

T h e r e q u i r e d e n t r i e s a r e t r a n s f e r r e d i n t o t w o s t e p s- From spl i t t ing attr ibute order to ti d sorted order- From t id sorted order to attr ibute l is t order

High Performance Data M ining (Vipin Kumar and Mohammed Zaki) 8, I

This Design is Inspired by..Communication Structure of Parallel SparseMatrix-Vector Algorithms.

0 1 2 3 4 5 6 7 8o x 01 X 0 P0 3 o x i4 0 X P15 X 06 x o l i7 X O P28 0 X 0

X : Salary O : Age ~ : Node Table Entries[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8 2 I

3 5 1

8/8/2019 dm tutoria

43/116

Parallel Hashing Paradigm[ ] Distr ibuted Hash Tab le .[ ] Hash Function: h(k) = (Pi, I )[ ] Construction:

- ( k , v ) - -> ( p i , I ) - -> f o r m s e n d b u f f e r s { ( I , v )} f o r e a c h P i - -> a l l -t o - a l l - p e r s o n a l i z e d c o m m u n i c a t i o n .

. . Enquiry:- ( k ) - -> ( P i , I) - -> f o r m { ( I ) } b u f f e r s f o r e a c h P i - -> a l l -t o - a l l -

p e r s o n a l i z e d c o m m - - > e a c h P i r e p l a c e s r e c e i v e d { ( I) } w i t h{ ( v ) } - - > a l l -t o - a l l - p e r s o n a l i z e d c o m m .

If each processor has m key s to hash, thentime is O (m ) if m is D,(p); i.e. D,(p 2) overal lkp.v.~

, j - . . . . . . . . . . . . . . . .High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8 3 1

Applying the paradigm toScalParC : U p dateSalaryrid Salary Age cid value rid cid

0 24542 7 0 0 24542 0 01 9 8816 33 0 P 0 49 241 2 12 49241 19 1 649 11 8 03 126146 38 1 9 47 6 6 4 04 9 4 7 6 6 5 0 0 P 1 9 7 6 7 2 5 05 9 7 6 7 2 2 4 0 9 8 8 1 6 1 1 06 1 3 6 8 3 8 4 0 1 S P " ' i ~ - 1 2 6 1 4 6 3 1 17 153032 58 1 ~2 136838 6,18 649 11 28 0 153032 7 l

P0

Pl

P2

A g e N o d e T a b l evalue rid cid rid kid

19 2 1 0-24 ~ 0 P0 1 -28 0 2 -33 1 0 3 -38 3 1 P I 4 -4 0 6 1 5 -5 0 4 0 6 ' -5 8 7 1 P 2 7 -70 00 8, -

(a) (b)h as h b u f f er sPO P2 0 2 1Po [(O,L)I(2,L)(2,L )[ m[(O,L)[(2,L)I(I,L ) N o d e T a b l e

P0 P1 P2P0 PI 4 5 3 31415 17181, I ( 1 . L ) I . . L ) I ( 2 . L ) I comm ~ , I ( 1 . L ) I ~ 2 . L ) I < 0 . R > update r i d ] O I 1 1 2 1 [ 6k ia lL I L lL I R IL I L IR IR IL I8 6 7 I II P2I c o , . ) ~ o , ~ ) 1 < 1 . . ) 1 ~ l ( 2 , L ) l < o , ~ ) l . , ~ ) l()


35 2

8/8/2019 dm tutoria

44/116

8/8/2019 dm tutoria

45/116

A worst case scenario for UpdatingHash Table

I O ne Processor might need to s end O (kN/p) data! HappensInfrequently. BIvalue rid cid

5 1 010 8 l

P O 1 5 4 020 6 02 5 2 0S P ~ . . . . .30 0 135 3 1P1 . . . .4 0 7 045 5 ' 1

B 2value rid cid~4314 0. . . . P 0 21 0 1

28 5 1_ _ 3 5 7 0

42 2 0P1 49 8 15 6 1 0

6 3 6 0

BI B2 ~ BI B2Sp ~1 0-8 . 1 . - 42 2 0 - 30 0 1 - I 7 3 1

35 3 1 P0 21 0 1.15 . 4 . 0 49 8 1 P1 40 7 0 28 5 1 . , , , t_SpP0 20 6 0 P1 56 1 0 35 ? " 0 "25 2 0 63 6 0 45 5 1 -p-11 I ~ I . . . . P - 1 ~ ~ I I/ x z

High Performance Data Mining (Vipin Kumar and M o h a m m e d Z a k i) 8 7 i

Categorical/Continuous andContinuous/Categorical Specia l cases of the Cont inuous/Cont inuous

- Categorical /Co ntinuous does not requirecommunication for loading the hash table

- Continuous /Categorical does not requirecommunication for inquiring the hash table

1High Performance Data Mining (Vipin Kumar and M o h a m m e d Z a k i) 8 8 [3 5 4

8/8/2019 dm tutoria

46/116

Algorithm: Level -w iseCommunications T ree is bu i l t in a bread th- f i rs t man ne r . A t eac h leve l o f the dec is io n t ree- Coun t Matrices for all attr ibutes for all nodes are

reduced in one single communicat ion approach.- Lo ading the hash-table for all the nodes iscombined into a single communication operation- Inquiring the hash-table to spl i t a particularattribute list is combined in a single

communicat ion operat ion- A total of k + 1 All-to-Al l personalizedcommunicat ion operat ions are performed for each

level of the treeHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 89 I

Structure of the Algorithm Sort Contin uous Attr ibutes (Pre-Sort) do level-wise while (at least one node requiring spl i t)

- Com pute count matrices (Find-Split I)- Compute best gini index for nodes requiring spl i t(Find-Split II)- Parti t ion spl i tt ing attr ibutes and update NodeTable (Perform-Spli t I)- Parti t ion non-spl i tt ing attr ibutes by inquiring NodeTable (Perform-Spli t II)

end doHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 90 I

3 5 5

8/8/2019 dm tutoria

47/116

ExampleSalaryvalue rid cid

PI PIS P ~P2 P2

Ag e Counl Matricesl o c a l global

2 4 5 0 R2818 0 0 I 0 I33,,0 "'~ i [ ]3813 1o-~N IN5 8 1

7 0 0

Node a b l erid kid

Salary Age ~ N o d e T a b l e

PI ~ ~ [ ' ~ p ~

,'" " ' . Salary A g e"' " "' " I 15303217111 P21 ~ 17111

High Performance D ata Mining (Vipin Kumar and Moha mmed Zaki) 91

Experimental Resu lts E x p e r i m e n t s w e re p e r f o rm e d o n a 1 2 8 -

p r o c e s s o r C r a y T 3 D T ra i ni n g s e t s w e re s y n t h e t i c a l l y g e n e ra t e d

- Each contained only continuous attributes (5 -9 )- The were two possible class labels

High Performance Data Mining (Vipin Kumar Zaki) l]and Mohammed 923 5 6

8/8/2019 dm tutoria

48/116

Parallel Runtime

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 93 ]

C onstant size/p rocess or

I Performance Data Kumar and Mohammed 94 Iigh Mining (Vipin Zaki)3 5 7

8/8/2019 dm tutoria

49/116

Different Number of Attributes

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 951

SMP Parallel Design Space Data paral le lism: wi th in a t ree nod e

- Spl i t a t t r ibute s : ver t ica l BASIC Fixed Win dow Moving Window

-S p l i t da ta (a t t r i bu te li sts) : hor i zon ta l Task para l le l ism: between t ree nodes

- S U B T R E E Stat ic vs. dyn am ic load ba lanc ing


8/8/2019 dm tutoria

50/116

SPRINT: Attribute ListsT r a in in g S e t A t tr i b u t e is t s

T i d A g e C a rT y p e C l a s s

2 4 3 s p o r t s H ig h

4 3 2 t r u c k Low

A g e C l a s s T i d C a r T y p e [ C l a ss i T i d

2 os ~ I r n~. . . . , , |

3 2 L ow 4 ~ : : f ami l ~ : :7: l: ! !4 3 t r u c k I L o w 4

c o n tin u o u s so r t e d ) c a t e g o r i c a l o r i g o r d e r )High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 97 I

Splitt ing Attribute LiAttribute l i s ts lbr node 0

A g e C l a s s T i d C a r T y p e C l a s s T i d

32 Low 4 family Low 343 High 2 truck Low 4

T idC h i l d

Attribute l i s ts for node 1

H a s h T a b l e

s tsD e c i s i o n T r e e

A g e < 27.5

IAttribute l i s ts for node 2A g e C l a s s T i d32 Low 443 High 2

Car Type [ Class T idHigh 2 1t r i c k Low 4

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 98 J3 5 9

8/8/2019 dm tutoria

51/116

SPRINT: File per Attribute

O O 0 O 0 ~ O O 0TOTAL FILES PER ATTRIBUTE: 32

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 9 9 1

Optimized Fi le Usage

@ @@@TOTAL FILES PER ATrRIBUTE: 4


8/8/2019 dm tutoria

52/116

BASIC: Data Para l l e l G iven curren t leaf front ier Atomical ly acquire an at t r ibute andevalu ate it for all leave s Master f inds winning at t r ibute and

constructs hash table Atomical ly acquire an at t r ibute and spl i t

its list for all leaves Barrier synchronizat ion be tween pha ses


BASIC (Tree View)P={O,L2~3}

BARRIER| l | l l | | | |

I l l l l l l l

I I I l l l /G J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 102 1

36 1

8/8/2019 dm tutoria

53/116

BASIC (Level View, A=3, P=4)II PO(4) PI(4) II P2(4) n p3(o)

@

_,eafI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 103 I

Fixed W indo w : Da ta Paral le l Par ti tion leave s in to b lock s/win dow of K Dynamical ly acqui re any at t r ibute for

any leaf wi th in current b lock andevaluate i t

L ast proce ssor to wo rk on a leaf notesthe winning at t r ibute and bu i lds thehash tab le Spl i t data as in BASIC

Bar rier synchron iza t ion be tween eachblock of K leaves

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 104 I3 6 2

8/8/2019 dm tutoria

54/116

FW K (Tree V iew)e={o,l,z,3}

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 105 I

FW 2 (Level View )i l r,o(4) D el(4) e2(2) P3(2)

I 1 ~ ' IL e a fW i n d o w = 2

Leaf Lea1

High Data Mining (Vipin Kumar and Mohammed Zaki) 106 IPerformance3 6 3

8/8/2019 dm tutoria

55/116

M oving W indow : Data Paral lel Part i t ion leaves into blocks/window of K Dyn am ical ly acqui re any att r ibute forany leaf, say i , within current block Wait for last block's i - th leave L ast proc ess or to wo rk on a leaf notes

the winning at t r ibute and bui lds thehash tab le

Spl i t data as in BASIC No barr iers, only condi t ional wai t

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) ~ 0 7 1

MWK (Tree Level)P={0, I ,2,3}~ | e o | l a | | e p

C H R O N I Z A T I O N .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . , . . . . . ,

. . . .

, ' ~ '" , 1 ' " J '- ' ' " " ~ '~. -0.10 ,:~ .0Di l l l l l l l l n i l l l n | l l d l l l i b l B i l l l i nd i l l l l lHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 108 ]

3 6 4

8/8/2019 dm tutoria

56/116

8/8/2019 dm tutoria

57/116

8/8/2019 dm tutoria

58/116

Experimental DatasetsDataset Function #Attr

F2A8D1M F2F2A32D250K F2F2 A6 4 D12 5 K F2

F7A8D1 M F7F7A32D2 50K ; F7F7 A6 4 D12 5 K F7

Num DBSize Num MaxRecords Levels Nodes1000K 192MB 4 22 5 0K 192 M B 4 2125K 192MB 4 2

1000K 192MB 6 0 46 622 5 0K 192 M B 5 9 8 0212 5 K 192 M B 5 5 38 4

8326 48

326 4


Setu p and Sort T imeDataset Setup(sec)F2 A8 D1M 7 2 1

F 2 A 3 2 D 2 5 0 K 6 8 5F 2 A 6 4 D 1 2 5 K 7 0 5

F 7 A 8 D 1 M 9 8 9F 7 A 3 2 D 2 5 0 K 8 3 8F 7 A 6 4 D 1 2 5 K 6 7 2

Sort(sec)6 335 986 2 68 177 8 06 36

Tota l Setup% Sor t%( s e c )3 5 9 7 2 0 % 1 8 %3 5 8 4 1 9% 1 7 %3 6 6 5 1 9% 1 7 %

2 3 36 0 4 % 4 %2 4 7 0 6 3 % 3 %2 2 6 6 4 3 % 3 %


8/8/2019 dm tutoria

59/116

Paral lel PerformanceF2A64D125KID MW4 SUBTREE

2500 2500020000~ 2 0 0 0 - ~

,, , 1 5 0 0 = t=ooE . _Ei= = ' -p, 1 00 0 ._ 1oo00-, p,= 500 =,>f ~o.

0 0,

F7A64D125K[D MW4 SUBTREE

1 2 3 4 , 2 3 4Number o f ProcessorsNumber o f Processors

I High Performance Data Mining (Vipin Kum ar and Moham medZaki) 115 I

Paral le l Performan ceA64D125K A64D125KI M W 4 - F 2 SU B T RE E . F 2 I - - m - F 2 SU B T RE E . F 2

MW4.F7 ~ SUBREE.F7 ~ MW4-F7 ~ SUBREE-F74 3 . 5

3.52 3' "~ ~ 2

~ 1.5

0 . 5 ~ 0 . 50 . . . . 0 . . . .I 2 3 4 I 2 3 4

Number of Processors Number of ProcessorsI High Performance Data Kumar and Mohammed Iining (Vipin Zaki) 1163 6 8

8/8/2019 dm tutoria

60/116

Tutorial overview Overview of KDD and data mining Paral lel design space Classif icat ion l Ass ociat ions Sequences Clustering Future di rect ions and summaryl High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 1 1 7 I

What is association mining? Given a set of i tems/attr ibutes, and a set

of objects containing a subset of thei tems

Find ru les: if I1 the n 12 (sup , con f) I1, 12 are s ets of items I1, 12 have suf ficient suppo rt: P(11+12) Rule has sufficient confidence: P(12111)


8/8/2019 dm tutoria

61/116

Association mining Us er specifies " interestingness"- Mi ni mu m support (minsup)

- Minimum confidence (minconf) F ind all frequent itemsets (> minsup)

-Exponent ial Search Space-Computat ion and I /O Intensive

Generate strong rules (> minconf)- Relat ively cheapI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 119 I

Association Rule Discovery: Supportand C onfidence

i i l i!!!! !~ iiiii! !!!i I~i! i ! !i iill ii ii!i !! l AsseiatinRulei~ Xs i a ; Y IS - -

Example:{Diaper,Milk}=:~,,~Beertr(Diaper, Milk, Beer) 2=- -=0 .4Total Number of Transactions 5

ty (Diaper, Milk, Beer) = 0.66o-(Diaper, Mil k) [

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 120 [37 0

8/8/2019 dm tutoria

62/116

Handling Exponential Complexity G iven n trans action s and m di f ferent items:

- num ber of possible association rules: O(m2 m-l)- c o m p u t a t i o n c o m p l e x i t y : O(nm2m)

System at ic search for a ll pat terns, b ased onsupport constra in t [Agarwal & Srikant]:- I f {A ,B} has su pp or t at least (z, then both A and B

ha ve sup por t a t l ea s t t~.- I f e i t h e r A o r B h a s s u p p o r t le s s t h a n ~ , t h e n { A , B }

has suppor t l e s s t han t ~ .- U se pa t t e rn s o f k - l i t ems t o f i nd pa t t e rns o f k i t ems .


A priori Principle Col lec t single i tem counts. Find large items. F ind cand idate pa i rs, count them = > largepairs of i tems. F ind cand idate t rip le ts, count them => largetriplets of i tems, And so on... G uiding Principle: Ev e ry subse t o f af requ e nt i te m set has to be f req ue nt.

- U s e d f o r p r u n i n g m a n y c a n d i d a t e s .


8/8/2019 dm tutoria

63/116

Illustrating Apriori PrincipleItems (1-itemsets)

I Minim um Suppo rt = 3 IIf every subset is considered,

6C1 + 6C2 + 6C3 = 41With support-based pruning,6 + 6 + 2 = 1 4

Pairs (2-itemsets)

~ i l Triplets (3-itemsets)

".%I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 123 ]

Count ing Candidates F requent Itemsets are found by counting

candidates. Simple way:

- Search for each candidate in each transaction.E x p e n s i v e ! l !

T r a n s a c t i o n sN

IM

C a n d i d a t e s


8/8/2019 dm tutoria

64/116

Assoc ia t ion Rule Di scovery: Hashtree for fast access.

C a n d i d a t e H a s h T r e eH a s h F u n c t i o n l


Assoc ia t ion Rule Di scovery: Subsettransaction

OperationH a s h F u n c t i o n

: ,9

High Mining (Vipin Kumar and Mohammed Zaki) 126 }erformance Data3 7 3

8/8/2019 dm tutoria

65/116

Association Rule Discovery:O p eration (con

transact ion

SubsetHash Function

\1 2 4 1 5 54 5 7 4 5 8

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 127 J

ParallelRules Formulation of Association Need"

- Huge Transact ion Dat aset s (10 s of TB)- L a r g e N u m b e r o f C a n di da t e s .

D at a D is t r ibu t ion:- P ar t i t i on the Tr ansac t ion D a tab ase , o r- Part it ion the Can dida tes , or- Both


8/8/2019 dm tutoria

66/116

8/8/2019 dm tutoria

67/116

8/8/2019 dm tutoria

68/116

Parallel Association Rules: IntelligentData Distribution (IDD) Data Distribution using point-to-pointcommunication. In te l ligent par t i t ion ing of can dida te sets.

- P a r t i t i o n i n g b a s e d o n t h e f i r st i t e m o f c a n d i d a t e s .- Bitmap to keep track of local candidate tems.

Prun ing at the root o f can dida te hash t ree us ing theb i tmap.

Su i tab le fo r s ing le da ta source such as da tab aseserver.

W i th sma l ler can didate set , load ba lancing isdif f icult .High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 133 I

IDD: I l lustration

C o u n t

-I

~ tDat aShif t

-fto-AllBroadca st o f C a n d i d ~High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 134 I

3 7 7

8/8/2019 dm tutoria

69/116

Filtering Transactions in IDDtransactionbitmask~


Parallel Association Rules: HybridDistribution (HD) Ca ndid ate set is part i t ioned into G groups to

just f i t in main memory- Ensures Good load balance w ith smallercandidate set.

L og ica l proce ssor m esh G x P/G is formed. Perform IDD a long the co lum n process ors

- Data movement among processors isminimized. Perform CD a long the row proce ssors

- Smaller number of processors is globalreduction operation.[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 136 ]

3 7 8

8/8/2019 dm tutoria

70/116

HD: I l lustration

8o ~.

PIG Processors per GroupCDalongRows


Parallel Association Ru les:Experimental Setup

12 8 -processor Cray T3D- 150 MHz DEC Alpha (EV4)- 64 MB of main memory per processor- 3-D torus interconnection network with peak unidirectional

bandwidth of 150 MB/sec. MPI used for comm unicat ions. Synthetic data set: avg transaction size 15 and 1000

distinct items. For larger data sets, mu ltiple read of transa ctions inblocks of 1000. HD switch to CD after 90.7 % of the total compu tat ionis done.I High Performance Data Kumar and Mohammed 1 Iining (Vipin 7aki) 38

3 7 9

8/8/2019 dm tutoria

71/116

Parallel Association Rules: ScaleupResults (100K,0.25%)

i i i i iCD -~ ...ID D -+--20 00 .~

8/8/2019 dm tutoria

72/116

Paral le l Associat ion Rules : ResponseT ime (np= 1 6 ,50K)

5 0 0

4 5 0

40 0

35O

&n-

3O O

2 5 02 0015 0100

5 0

0

i iCD -~--HD -+--"IDD -o--simple hybr id --~ ....

x,. /:p..."/~,"(2408 K)

. , - . /

......... ..::;: ;"ii o8,~), . , . " ' " . S . 2 ; ; " "

- E l " " , . . ;: ; / "

1211 K)i I I IO 5 0.25 O.1 0.06Minimum support (%)

High Performance Data Mining (Vipin Kumar and M ohammed ~ ki ) 141 I

!

Paral le l Associat ion Rules : ResponseT ime ,(np =64 ,50K) , , , CD - e~HD -4---1200 IDD -o--s imple hyb rid -.~ ....

1000

2 00

0

x. . . m /. . . . . / ~ (52 32 K)..o-'" .. ~/

..... ..... . ,/" 12408 K)...-." " /. . - " ' " .....,. /................... ........................~ o ~ ~

(345 K)121T1 K) I ! I I0.5 0.25 0.1 o.oe.04Minimum support (%)High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 142 1

381

8/8/2019 dm tutoria

73/116

Paralle l A ssociat ion Rules :Min imum Suppor t Reachab le

0.25 ~ 0.15 0.06 0.030.2 0.1 0.04 0.02


Paral le l Assoc iat ion Rules:Processor C onf igura tion in HD64 Processors and 0 .04 min imu m suppor tml ~l l ~ mm 4 ~6 ~8~ ~ ~ , x ~ ~ , ~ ~ ~ x ~ ,

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 144 I382

8/8/2019 dm tutoria

74/116

Parallel Association Rules:Summary of Experiments[ ] HD shows the same l inear speedup and

sizeup behavior as that of CD.[ ] HD Exploi ts Total Aggregate Main

Memory, whi le CD does not.[ ] IDD has much better scaleup behaviorthan DD

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 145 i

Eclat Approach[] Frequent itemset lattice[] Vertical or "inverted" tid-l ist format[] Support counting via intersections[] Latt ice decomposit ion: break into

subprob lems[] E fficient search strategies[] Independe nt solution of sub prob lems

I High Mining (Vipin and Mohammed Zaki) 146 JPerformance Data Kumar3 8 3

8/8/2019 dm tutoria

75/116

D I S T I N C T D A T A B A S E I T E M SJ a n e A g a t h a S i r A r t h u r M a r kA u s t e n C h r i s t i e C o n a n D o y l e T w a i n

A C D TD A T A B A S E

S u p p o r t I I t e m - S e t s

P . G .W o d e h o u s eW

A L L F R E Q U E N T I T E M - S E T ST r a n s c a t i o n I t e m s

1 A C T W2 C D W3 A C T W4 A C D W5 A C D T W6 C D T

1 0 0 % (6 H C

( 4 )1[ A , D , T , A C , A W6 7 % C D , C T , A C W5 0 % ( 3 )

N

A T , D W , T W , A C T , A T WC D W , C TW , A C T W

Example Database

M A X I M A L I T E M - S E T S ( M IN S U P P O R T = 5 0 % ) : A C T W , C D WI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 147 II I

Frequent ItemsetLatticei / . / :. . . . . . . . . . . . . . . . . 2 1 1 / . : : : : - ~ : ~ : " T " " ~ : : : : . : : : : : : . . ~ I I I I I. . . . . . . . . . . . .. . . . . . . . . . . . - . . . . . " . . . . . . . . . . . . . .

0D O W N W A R D C L O S E D O N I T E M - S E T S U P P O R TM A X IM A L F RE Q U E N T I TE M -S E TS _ " A C ~ C D WHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 14 8 I

3 8 4

8/8/2019 dm tutoria

76/116

Eclat: Support CountingA C D ~................ ~ ...: : : :.......: : : : ..-T"---..: : : : : : ....: : : : :... . . . . . . . . . . .

!A C D T W C O

i i ,n t e r s e c tC & DT R A N S A C T I O N L I S T S ( T I D L I S T S IC W

High Performance Data Min ing (Vip in Kumar and Mohammed Zaki )

lD W/1n t e r s e c tC D & C W 1491

Eclat: Lattice Decomp ositionA C D ' r V v

.... " " ' - , . i ' " ' " . . " / . . . . ' " ' " " "{ }

E Q U I V A L E N C E I , A , : ~ . . A C . A . . A W , A C ' r , A C W . A ~ . A C . W , ]C L A S S E S [ C ] C , C D , C T , C W , C D W , C T ~N }[D ] = { D , D W } [31"] = { T , TW } [W ] = { W } IC R O ~ - C L A ~ L IN K~ U ~ ; F D F O R P R U N I N GHigh Pe f fo~nance Da ta Min ing (V ip in Kumar and M ohamm ed Za ld ) 150 I

3 8 5

8/8/2019 dm tutoria

77/116

Lattice Search Strategies Bottom-up

- Lev el - wis e l ike Apriori Top-down

- Start with larges t element. If frequent,done! Else look at subsets

Hybrid- Find long frequent/m aximal itemsets- F in d remaining using bottom-up searchI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 151 ]

Bottom-up Lattice SearchA C D ~. . . . . , . " . " . " - , . . .

. . ' "A C D T A C D W" ~ '- ~ : : ~ : . . . . . . . . . . . '" ~ : .: : . . . . . .

A C I D

A D T W. . . . : : : : { / : > ' "/ .

N E W E Q U IV A L E N C E I [A C ] = { A C , A C T , A C W , A C T V V } IC L A S S E S [AT ] = { AT, ATVV }l A W ] = { A W }K N O W N M A X I M A L F R E Q U E N T I T E M S E T S : C D W , C T ~ /

High Performance Data Mining (Vipin Kum ar and Mohammed Zaki) 152 I3 8 6

8/8/2019 dm tutoria

78/116

8/8/2019 dm tutoria

79/116

Count Di s tr ibut ionP R O C E S S O R 0 P R O C E S S O R 1 P R O C E S S O R 2P A R T I T I O N E D D A T A B A S E

C O U N T I T E M SI I G E T F RE Q U E N T I T E M S

F O R M C A N D I D A T E P A I R SC O U N T P A I R S

B G E T F R E Q U E N T P A I R SF O R M CA N D I D A T E T R I P L E SC O U N T T R I P L E S

MHigh Per fo rmance Da ta Min ing (V ipin Kumar and Moham med Zak i ) 155 ]

Paral le l EclatP R O C E S S O R O P R O C E S S O R 1 P R O C E S S O R 2

P A R T I T I O N E D D A T A B A S E

S E L E C T I V E L Y R E P L I C A T E D D A T A B AS E

I Per fo rmance Da ta Kumar and Moham med 156 Iigh Min ing (Vip in Zaki )3 8 8

8/8/2019 dm tutoria

80/116

Parallel Eclat AlgorithmI T E M S E T L A T T I C E C L A S S P A R T I T I O N I N G

E Q U I V A L E N C E C L A S S E S[ A ] = { A C , A T , A W }[ c ] = { C D , C T , C W }[D ] = { D W }i - r ] = { - r w }

C L A S S S C H E D U L EP r o c e s s o r 0 ( P 0 )

P r o c e s s o r 1 ( P 1 )I C ] = { C D , C T , C W } I[ D ] { D W }T I D - L I S T C O M M U N I C A T I O N

In ml i i i l ! ! l l ! ! ! l, i = , "O r i g i n a l D a t a b a s eA C D T W P a r t i t i o n e d D a t a b a s e A f t e r T i d - L i s t E x c h a n g e IP 0 P 1 P 0 P 1 J[~-~ ~---I IC o T w l l

/ m m m u l u m = i = u li n = = . = H i i ' = ' = m = = l lI n n i I l l l l l l I I n il ll i b n i m l I i I i i l l[ - - - " 1 1- - - =11High Performance Data Mining (Vip in Kumar and Mohammed Zak i ) 157 I

Eclat Ex p eriments[ ] Database: T20 .16 .D4550K

- 1 0 0 0 It em s- 4 5 5 0 , 0 0 0 Tr an s ac t io ns

[ ] Machine: Eight 4- way SM P nodes- D ig ita rs Memory Channel ( 5 gs ,30Mb /s )- 23 3 Mhz , 25 6 MB, 1MB L2 Cache

[ ] Hierarchical Paral lel ization- M e s s a g e pas s in g + S M P

High Performance Data Mining (Vip in Kumar and Mohammed Zaki ) 1 5 8 ]3 8 9

8/8/2019 dm tutoria

81/116

Paral le l Perform anceP a r a l le l P e r f o r m a n c e 1 P r o c / H o s t'

6000

5000

4000

3~00

21100

H1 H2 H4 H8Numberof Hosts

l O O O

0

P a r a l le l S p e e d u p ( 1 P r o c / H o s t ) I


Paral le l PerformanceP a r a l le l P e r f o r m a n c e 2 P r o c / H o s t

H1 H2 H4 H8Numberof Hosts

3500

300O

2500

21100

1500

IOO0

5130

P a r a l l e l S p e e d u p ( 2 P r o c / H o s t )7 r

6

5

4

3

2

!

0

I Performance Data Kumar and Mohammed 160 IIHigh Mining (Vipin Za~d)3 9 0

8/8/2019 dm tutoria

82/116

Tutorial overview Overview of KDD and data mining Parallel design space Classif icat ion Associations [Sequences Clustering Future directions and summary


Discovering Sequential A ssociationsGiven:A set of objects with associated event occurrences.

eventsObject 1

Object 2l0 20 30 40 50

imel ine

Performance Data Kumar and Mohammed 162 IHigh Mining (Vipin Zaki)39 1

8/8/2019 dm tutoria

83/116

Sequential Pattern Discovery:DefinitionGiven is a set of objects, with each ob ject associated with i ts ownt ime l ine o f e vents, f ind rules that predict strong sequentialdependencies among dif ferent events.

(A B) (C) }, (D E) iRules are formed by f irst d isovering patterns. E vent occurrences inthe patterns are governed by t iming constraints.

(A B) (C) (D E)i: < -_x~ "~LI"= ,Tr 1


Sequential Patterns: ComplexityMuch h igher computa t iona l com p lex i t y thanassociation rule discovery.

- O ( m k 2 k -l ) num be r o f poss ib le sequen t ia l pa t te rns havingk events, where m is the total number of possible events.

Time constraints offer some pruning. Further use ofsupport based pruning contains complexity.- A subsequence of a sequence occurs at least as many t imesas the sequence.- A sequence has no more occurrences than any of i tssubsequences.- Bui ld sequences in increasing num ber of events. [GSPalgorithmby Agarwal & Srikant]

High Mining (Vipin Kumar and Mohammed Zaki) 164 ]Performance Data3 9 2

8/8/2019 dm tutoria

84/116

8/8/2019 dm tutoria

85/116

Sequential Apriori:Count operation (contd..) Hash Tree used for fast search of cand idateoccurrences. Similar to association rulediscovery, except for following differences.

- Every event-timestamp pair in the timelineis hashed at the root.- Events eligible for hashing at the next levelare determined by the maximum gap (xg),

win dow size (ws), and span (ms)constraints.I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 167 I

Sequence Hash Tree for Fast AccessHash Function

2,5 ,8(1@15)

(1@0)(4@0)(1@12)

1( 2@ 5 )(s@s)(1 )(1 )(9 )

I 1 )(2 ,5 )( 4 , 5 , 8 )Candidate Sequence Hash TreeI

( 3 @ 5)

1 ( 7 ) ( 6 , 9 )_~5)

( 1 ) ( 5 ) ( 9 )(1 ) (2 ,3 )( 4 ) ( 8 , 9 )

Object:

1 9I 1 I I I0 5 10 15 20

m s< >

1 F o u r I n t e r e s t in g P a t h s : |I 1 @ 0 , 2 @ 5 , 3@ 5 I1 @ 0 , 3 @ 5| 4 @ 0 , 5 @ 5I 1 @ 1 2 , 1 @ 1 5


3 9 4

8/8/2019 dm tutoria

86/116

C ounting Leaf Lev el C andidatesP a r t of Candidate Sequence Hash__Tree

Count = 12

Object 2:

I I[ " < - > ~I .Performance Data (Vipin Kumar and Mohammed Zaki) 16 9 Iigh Mining

Parallel Seq uential As soc iations Need:

- Enormity of Data.- Mem ory and Disk l imitations of serial algorithms

running on single processor. Can algorithms for non-sequential

associations be extended easily?- Sequential Natu re gives rise to compl ex issues:

L o n g e r T i m e l i n e s L a r g e S p a n v a l u e s L a r g e N u m b e r o f C a n d i d a t es

I Performance Data Kumar and Mohammed 170 IHigh Mining (Vipin Zaki)3 9 5

8/8/2019 dm tutoria

87/116

Parallel Sequential Associations:Event Distribution (EVE)

P0 P1 P2

~ 1 o All Broadcast of Support C o u n ~High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 171EVE Algorithm: Challenging Case

Transfer Hash Tree StatesP0 P1 ~,~To~e~ P2 ~

I3 9 6

Transfer Partial CountsHigh Performance Data Mining (Vipin Kum ar and Mohammed Zaki) ' 172 I

8/8/2019 dm tutoria

88/116

Event and Candidate( E V E C A N )

e i t h e r

DistributionRotate Candidates inRound-Robin Manner

or inHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) ~7~J

E V E C A N :Generation

Parallelizing Candidate

I Candidates Stored in a Distributed Hash Tab leI Hash F unction:Candidate Sequence, S = > h(S ) = (Pi, I )

ii i ,: W, AC->T, ~!AC->D ~ J AU >W ~A'>IW :~.I ~ i '" [ , AC->TW C->D.

MAXIMAL FREQUENTSEQUENCESAC->D AC->TW C->D->TW176 I

8/8/2019 dm tutoria

90/116

Frequent sequence latticeL A T T I C E I N D U C E D B Y M A X I M A L F R E Q U E N T S E Q U E N C E S

A C - > D , A C - > T W , C - > D - > T W

{}High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 177 I

Sequence lattice decomposition

{ }Performance Data Mining (Vipin Kumar and Mohammed Zaki) 178 Iigh

3 9 9

8/8/2019 dm tutoria

91/116

Temporal JoinsC u s t o m e r - T r a n s a c t i o n L i s t I n t e r s e c t i o n

P - > X - > Y

P - > X

4 I 607 | 40

-m~l- :~ r -' 1 5 I s o1 7 I 2 02 0 I 1 0

P - > Y( 3 1 1 0 I T I I )

3 I 10s / 7 0

1 3 1 01 6 8 02 0-]

P - > Y - > X

P - > X YI C l I ~ T i n

i :

Dynamic C omp utation Tree


l i p

t Performance Data Kumar and Mohammed 180 Iigh Mining (Vipin Zaki)40 0

8/8/2019 dm tutoria

92/116

Parallel Design Space Dat a parall elism: within a clas s

- I d l i s t para l le l ism Single idl ist join L eve l-wis e idl ist join

- Jo in para l le l ism Task paral le l ism: between classes

- pSPADE Static vs. dynamic load balancingHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 181 I

Idlist Data Parallelism Sing l e id l is t join

- Sp l i t i d l i s t s on c i drange

- E ach proc intersectsover i ts c id range

- Barr iersynchron izat ion foreach jo in

P0: cid in range 1-500PI: cid in range 501-1000Parallel Idlist Intersection

I Mining (Vipin Mohammed Zaki) 182 Iigh Performance Data Kumar and4 0 1

8/8/2019 dm tutoria

93/116

Id l i s t Data Para l le l i smL e v e l -w i s e i d l i s t j o i n- Proce ss all c lass es

at current level- E ach proc st il l

intersects over i tslocal DB

- Barr iersynchron iza t ion andsum-reduct ion fo reach level

P0 : c i d i n ra nge 1 -500P I : c i d i n r a n g e 5 0 1 - 1 0 0 0

Pa ra l le l Pa i r -W i seIn t e r se c t i ons


Join Data Para l le l i sm E ach processo r

per forms d i f ferentintersect ions wi thin aclass

Barr ier af ter sel f - jo in ingwithin a class, beforeprocessing c lasses atnext level # synch ron izat ions isbe tween S ing le andLevel-wise paral le l ism

A ss i gn e a c h i d l i s t t o ad i f f e re n t p roc e s so r

I Performance Data Kumar and Mohammed Zaki) 184 IHigh Mining (Vipin

4 0 2

8/8/2019 dm tutoria

94/116

Task Parallelism:Static Load Balancing Given level 1 classes , C1, C2 , C3, ... Assign weights W l, W 2, W 3, . .. on the class size (#

sequences in the class) G reedy schedul ing of ent i re c lasses

- Sort c lasses on weig ht (descending)- Assign class to proc wi th least total weight

Each proc solves ass igned c lasses asynchron ously Hard to get accurate work est imate


Static and Dynamic~Ioadq~l~O qalincin~

High Performance Data Mining (Vipi n~umar and Mohammed Zaki) 186 I4 03

8/8/2019 dm tutoria

95/116

Task Parallelism:Dy namic Load Balancing

Give n level 1 classes, C1, C2 , C3, ... andtheir weights

Sort classes on weight (descend ing) andinsert in a logical task queue

A proc atomical ly grabs the f irst avai lab leclass and solves i t completely

Rep eat unti l no mo re classe s in que ue Uses only inter-class parallelism Q ueue contention negl ig ib le s ince c lasses

are coarse-gra inedHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 187 I

Ta sk Pa ra l l e li sm :Recursive DLB (pSPADE) Proce ss availab le classe s in paral lel W orst case: P-1 procs free, 1 proc bu sy Classes sorted on weight, last class

usually small (but i t could be large) Provide me cha nism for free procs to

join busy group At each level, get free procs, insert theclasses into a shared task queue;process avai lable c lasses in paral le l


8/8/2019 dm tutoria

96/116

pSPADE: RecursiveD,;namic LoadBalancing

I

] High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 189 ]

Experimental Setup SG I O r i g in 2 000

- 12 195Mhz R10K MIPS processors- 4 M B cache, 2GB memory

S y n t h e t i c d a t a s e t s-C: number of t ransact ions/customer- T : average transaction size- S/I: average sequen ce/item set size- D : number of customers

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 190 I405

8/8/2019 dm tutoria

97/116

D a t a v s . T a s k P a r a l l e l i s mC10-T5-S4-11.25-D1M C20-T5-S8-11.25-D1M

500"400'

O )._E 300-I- -r-.2 200.100-

O.

i D D a t a T a s k I400350

-'G" 300250

.E_ 200i - - 150.m 100u~ 50

DDa t a 1 T a s k I

1 2 4 1 2 4Num ber of Pro cess ors Number of ProcessorsI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 191 I

S t a t i c v s . D y n a m i c D L B

600-500 -400-

p. .

~ 300-g200-100-

O_

[ l S t a t i c Dy n ami c R ecu r s i v e ]

C10T5S411.25 C20T5S811.25 C20T5S812

Dy namic is up to38% bet ter thanStatic

Recurs ive is up to12% bet ter thanDynamic

O ver all Recursiveis 44% better thanStatic

Performance Data Kumar and Mohammed 192 IIHigh Mining (Vipin Zaki)4 06

8/8/2019 dm tutoria

98/116

Paral le l PerformanceC10-T5-S4-11.25-D1M C10 - T 5 - S4 - 11 . 2 5 - D1M

400-350-

~. 300-250-

~ 200-.o t50-

..=

~, 100-50-

. 1 2 4 8Number of Processors

,

3.E. ni.,- 2,

~ I ..

t 2 4 8Number of Processors

High Performance Data Mining (Vipin Kumar end Mohammed Zaki) 193 I

Paral le l PerformanceC20-T5-S8-12-D1M C20-T5-S8-12.D1M

1400.1200.1000

~' 800'.s _I - -- - 600..o= 400,

"~ 200.T =

t 2 4 8 t 2 4 8Number of Processors Number of ProcessorsI High Performance Data Kumar and Mohammed 194 Iining (Vipin Zaki)

4 0 7

8/8/2019 dm tutoria

99/116

Scaleup and Supportf I180-

160-140-

=) t20-Em~'- t00-._o 80-~, 6 0 ~

40

o ~

CS-T2.5-S4-1125

2 4 8 10M illio ns f Cus tom ers

C5-T2.5.S4-11.25.D1M9 0 - ' /8 0 / - ;7 0 1 /

= 6 o /E / '~-- 50-.o 40-" - !o /

2010 ",~O "0.10% 0.08 % 0,05 % 0.03% 0.01%

Number of P r ocesso r sHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 195 J

pSPADE Summary Task parallel better than data parallel Dynamic load balancing Asynchronous algorithms (independent

classes) Good locality (uses only intersections) Good speedup and scaleup What next? Gigabyte databases

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 96 J1

4 0 8

8/8/2019 dm tutoria

100/116

Tutorial overview Overview of KDD and data mining Paral lel design space Classif ication Associat ions Sequences 1Cluster ingI Future directions and su m m aryHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) l g 7 r

What is clustering?Given N k-dimensional feature vectors,

f ind a "meaningful" part i t ion of the Nexamples into c subsets or groups

Discover the " labels" autom at ical ly c may be given, or "discovered" much more dif f icult than classif icat ion,

since in the latter the groups are given,and we seek a compact descr ipt ion


8/8/2019 dm tutoria

101/116

8/8/2019 dm tutoria

102/116

Clustering schemes Distance-based

- Numer ic Eucl idean distance (root of sum of squared

di f ferences a long each dimension) Angle between two vectors

- Categor ica l Nu mb er of comm on features (categor ical )

Partition-based- E num era te pa r ti ti ons and score each

High Performanc e Data Mining (Vipin Kumar and Mohamme d Zaki) 201 I

Clustering schemes Model-based

- Est imate a dens i ty (e .g . , a m ix ture o fgauss i ans )

- G o bum p- hun t i ng- C om pu t e P ( F ea tu r e V ec t o r i I C l us te r j)- F inds over lapp ing c lus ters too- E x a m p l e : b a y e s ia n c lu s te r in g

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 202 J411

8/8/2019 dm tutoria

103/116

Before clustering N o r m a l i z a t i o n :

-G i ve n th re e a t t r i bu te s A -- micr o-se cond s B -- mi l l i -s econ ds C -- sec ond s

- Can ' t t reat d i f ferenc es as the sa m e in a lld imens ions o r a t t r ibu tes

- N e e d to sca le o r no rma l ize fo r compar i son- C a n a ss ig n we ig h t f or m o re im p o r ta n ce[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 203 ]

The k-means algorithm Speci fy 'k ' , the number of c lusters Guess 'k ' seed c luster centers 1) L ook at each e x am ple and ass ign i t

to the center that is c losest 2 ) Recalcu la te the center I terate on steps 1 and 2 t ill ce nte rs

converge or for a f i xed number o f t imes!High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 204 J

4 1 2

8/8/2019 dm tutoria

104/116

K-means algorithm

00

0

0

00

0 0 0 : Initial seeds

oO 0O 0o/ o

' ! Performance Data Mining (Vipin Kumar and Mohammed Zaki) 205 Iigh I

K-means algorithmO O New centers ~w-o * * * *

0

0 0 0o ~ o0 0 0 0I=

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 206 [413

8/8/2019 dm tutoria

105/116

8/8/2019 dm tutoria

106/116

Paral lel k -mean s Divide N points among P processors Repl icate the k centroids Each processor computes distance of

each local point to the centroids Assign points to closest centroid andcompute local MSE Perform reduction for global centroidsand global MSE value


G a u s s i an mi x t u r e mo d e l s Drawbacks of K-means

- Doesn't do well with overlapping clusters-Clusters easl iy pu l led of f center by out l iers- E ach recor ds is either in or out of a cluster;

no not ion of some records begin more orless l ikely than others to really belong tocluster they have been assigned

G M M : probabi l ist ic variants of K-meansHigh Mining (Vipin and Mohammed 210 Ja k i ) 1Performance Data Kumar

4 1 5

8/8/2019 dm tutoria

107/116

Est imation-maximizat ion Choose K seeds: means of a gaussian

distr ibut ion Est imation: calculate probabil i ty of

belonging to a c luster based on distance Ma xim izat ion: move mean of gaussian

to centroid of data set, weighted by thecontr ibut ion of each point Repeat t i l l means don' t move


Devia t ion /ou t l i e r de tec t ion Find points that are very dif ferent from

the other points in the dataset Could be "noise", that causes problems

for classif icat ion or cluster ing Could be the really " interest ing" points,for example, in fraud detect ion, we are

mainly interested in f inding thedeviat ions f rom the norm

Performance Data Mining (Vipin Kumar and Mohammed Zaki) 212 Iigh4 1 6

8/8/2019 dm tutoria

108/116

D eviation de tection

outl ier .

J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 213 ]

K-nearest neighb ors Classif icat ion technique to assign a

class to a new example Find k-nearest neighbors, i.e., most

similar points in the dataset (compareagainst all points!) Assign the new case to the sam e classto which most of its neighbors belong


8/8/2019 dm tutoria

109/116

K-nearest neighbors0 0+~

0 0 0 0 0+0 0

0 0 0 0* * 2 + + o o

+ ++ + 0 0

0

Neighborhood5 of class 03 of class +- k -o

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 215 ntTuto r i a l ove rv iew

Overv iew o f KDD and data min ing Paral le l design space Classi f icat ion Assoc ia t ions Sequences Cluster ing l Future directions and summary

High Performance Data Mining (Vipin Kumar and Mohamm ed Zaki) 216J4 1 8

8/8/2019 dm tutoria

110/116

Large-scale Parallel KDD Systems Terabyte-sized datasets Centralized or distributed datasets Incremental changes Heteroge neous data sources Pre-processing, mining, post-processing Modular (rapid development) Benchmarking (algorithm comparison)

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 21 7 ]

Research Directions Fast algorithms: different mining tasks

-C lass i f i ca t i on , c l us te r i ng , assoc ia t i ons , e t c .- I n co r p o r a t i n g co n ce p t h i e r a r ch i e s

Parallelism and scalabil i ty- Mi l l ions o f records- T h o u s a n d s o f a t tr ib u t e s /d i m e ns io n s- S ing le pass a lgor i thms- Sampl ing- Para l le l I/O and fi le sy ste m s


8/8/2019 dm tutoria

111/116

Research Directions (contd.) Data locality and type

-Dis t r ibuted data sources (www)- Text and mult imedia mining- Spatial data mining

Incremental mining: refine knowledgeas data changes Interactivity: anytime mining


Research Directions (contd.) Tight database integration

- Push com mo n pr imi t ives ins ide DBM S- U s e mult iple tables- Use eff ic ient indexing techniques-C a ch in g s t ra tegies for sequence of datamining operat ions- Data mining que ry langu age and parallel

query opt imizat ionHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) J1220

4 2 0

8/8/2019 dm tutoria

112/116

Research Directions (contd.)[ ] Understandabi l i ty: Too many patterns

-Incorporate background knowledge-Integrate constraints- Meta-level mining- Visualization

[] Usabil i ty: build a complete system- P r e - p r o c e s s i n g , mining, post-processing,persistent management of mined results


Su mm ary O f the Tutorial Data mining is a rapidly grow ing field

- Fueled by eno rmo us data col lect ion rates, andneed for intel l igent analysis for business andscientif ic gains.

Large and high-dimensional nature datarequires new analysis techniques andalgorithms. Scalable, fast parallel algorithms arebecoming indispensable. Many research and commercialopportunities!!!I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 222 I

4 2 1

8/8/2019 dm tutoria

113/116

ResourcesWorkshops HiPC Special Session on Large-Scale Data Mining, 2000.http:llwww.cs.rpLedul-zakilLSDMI ACM SIGKDD Worksh op on Distdbuted Data Mining, 2000.http:/Iwww.eecs.wsu.edul-hillollDKDIdpkd2OOO.html 3rd IEEE IPDPS Worksho p on High Performance Data Mining, 2000.http://www.cs.rpi.edu/~zaki/HPDM/ ACM SIGKDD Workshop on Large-Scale Parallel KDD Systems, 1999.http:llwww.cs.rpi.edul-zakiiWKDD991 ACM SIGKDD Worksh op on Distributed Data Mining, 1998.http://www.eecs.wsu.edu/-hillol/DDMWS/papers.html 1st IEEE IPPS Worksh op on High Performance Data Mining 1998.http://www.cise.ufl.edu/- ranka/B o o k s A. Freitas and S. Lavington. Mining very large database s with parallel processing. Kluwer AcademicPub., Boston, MA, 1998. M.J . Zaki and C.-T. Ho (eds). Large-Scale Parallel Data Mining. LNAI State-of-the-Art Survey,

Volume 1759, Springer-Verlag, 2000. H. Kargupta and P. Chan (eds). Advance in Distributed and Parallel Knowledge Discovery, AAAIPress, Summer 2000.H igh Per f o rm ance Da t a M in ing (V ip in Kumar and Moha mmed Zak i ) 22 3 ]

Resources (contd . )Journal Special Issues P. Stolorz and R. Musick (eds.). Scalable High-Performance Computing for KDD, Data Mining andKnowledge Discovery: An International Journal, Vol. 1, No. 4, Decemb er 1997. Y. Guo and R. Grossman (eds.). Scalable Parallel and Distributed Data Mining, Data Mining andKnowledge Discovery: An International Journal, Vol. 3, No. 3, September 1999. V. Kumar, S. Ranka and V. Singh. High Performance Data Mining, Journ al of Parallel and DistributedComputing, forthcoming, 2000.Survey Articles F. Provost and V. Kollud. A survey of methods for scaling up inductive algorithms. Data Mining andKnowledge Discovery: An International Journal, 3(2 ): 131--169, 1999. A. Srivastava, E.-H. Han, V. Kumar and V. Singh. Parallel formulations of decision-tree classificationalgodthms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237 --26 2, 1999. M.J . Zaki. Parallel and distdbuted association mining: A survey. In IEEE Concu rrency special issueon Parallel Data Mining, 7(4):14-2 5, Oct-Dec 1999. D. Skillicom. Strategies for parallel data mining. IEEE Concurrency, 7(4 ):26 --35, Oct-Dec 1999. M.V. J oshi, E.-H. Han, G. Karypis and V. Kumar. Efficient parallel algorithm for mining associations.In Zaki and Ho (eds.), Large-Scale Parallel Data Mining, LNAI 1759, Spdnger-Verla g 2000. M.J . Zaki. Parallel and distdbuted data mining: An introduction. In Zaki and Ho (eds.), Large-ScaleParallel Data Mining, LNAI 1759, Spdnger-Vedag 2000.

IH i gh P e r f o r m a n c e D a t a M i n i ng ( V ip i n K u m a r a n d M o h a m m e d Z a k i) 2 2 4 I

42 2

8/8/2019 dm tutoria

114/116

8/8/2019 dm tutoria

115/116

References: Associations (contd.) A. Mueller. Fast sequential and parallel algorithms for association rule mining: A comparison. Technical

Report CS-TR-3515 , University of Maryland, College Park, August 1995. J.S . Park, M. Chen, and P. S. Yu. Efficient parallel data mining for association rules. In ACM Intl. Conf.Information and Knowledge Management, November 1995. T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for mining association rules. In 4th Intl.Conf. Parallel and Distributed Info. Systems, Decembe r 1996. T. Shintani and M. Kitsuregawa. Parallel algorithms for mining generalized association rules withclassification hieramhy. In ACM SIGMOD International Conferen ce on Managemen t of Data, May 1998. M. Tamura and M. Kitsuregawa. Dynamic load balancing for parallel association rule mining onheterogene ous PC cluster systems. In 25 th Int'l Conf. on Ve ry L arge Data Bases, September 1999. M.J . Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4 ):14--25 , October-December 1999. M.J . Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared-memory multi-processors. In Supercomputing'98, No vembe r 1996. M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. U. New algorithms for fast discovery of association rules.In 3rd Intl. Conf. on Knowledg e Discovery and Data Mining, August 1997. M.J . Zaki, S. Parthasarathy, M. Ogihara, and W. U. Parallel algorithms for fast discovery of associationrules. Data Mining and Knowledge Discovery: An Intemational Journal, 1 4):343-37 3, December 1997. M.J . Zaki, S. Parthasarathy, W. Li, A Localized Algorithm for Parallel Association Mining, 9th Annual ACMSymposium on P arallel Algorithms and Architectures (SPAA), June, 1997.

H i g h P e r f o r m a n c e D a t a M i n i ng ( V i pi n K u m a r a n d M o h a m m e d Z a k i) 2 2 7 I

References: Sequences R. Agrawal and R. Srikant. Mining sequential patterns. In 1 th Intl. Conf. on Data Engg., 1995. H. Mannila and H. Toivonen and I. Verkamo. Discovery of frequent episodes in event sequences. DataMining and Knowledge Discovery: An Intemational Journal. 1(3):259-289, 1997. T. Oates , M. D. Schmill, and P. R. Cohen. Parallel and distributed search for structure in multivar iate timesedes. In 9th European Conference on Machine Learning, 1997. T. Oates, M. D. Schmill, D. Jansen, and P. R. Cohen. A family of algorithms for finding temporal structure indata. In 6th Intl. Worksh op on AI and Statistics, March 19 97. T. Shintani and M. Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach.In 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, April 1998. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In

5th Intl. Conf. Ex tending Database Technology, Mamh 1998. M.J . Zaki. Efficient enumeration of frequent sequences. In 7th Intl. Conf. on Information and KnowledgeManagement, November 1998. Mohammed J. Zaki, Parallel Sequence Mining on SMP Machines, ACM SIGKDD Worksh op on Large-ScaleParallel KDD Systems, 1999 (LNAI Vo11759).

H i g h P e r f o r m a n c e D a t a M i n i n g ( V i pi n K u m a r a n d M o h a m m e d Z a k i) 2 2 8 J1

4 2 4

8/8/2019 dm tutoria

116/116

References: Clustering K. Alsabti, S. Ranka, V. Singh. An Efficient K-Means Clustering Algorithm. 1st IPPS Workshop on High

Performance Data Mining, March 1998. I. Dhillon and D. Modha. A data clustering algorithm on distributed memory machines. In Zaki and Ho (eds),Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springe r-Vedag 2000. L. lyer and J. Aronson. A parallel branch-and-bound algorithm for cluster analysis. Annals of OperationsResearch Vol. 90, pp 65-86, 1999. E. John son and H. Kargupta. Collective hierarchical clustering from distributed heterogeneous data. In Zakiand Ho (eds), Large-Sca le Parallel Data Mining, LNAI Vol. 1759, Springer-Verlag 2000. D. Judd, P. McKinley, and A. Jain. Larg e-scal e parallel data clustering. In Int'l Conf. Pattern Recognition ,August 1996. X . Li and Z. Fang. Parallel clustering algorithms. Parallel Computing, 11 :270 - -290 , 1989. C.F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21:1313--1325, 1995. S. Ranka and S. Sahni. Clustering on a hypercube multicomputer. IEE E Trans. on Parallel and DistdbutedSystems, 2 (2): 129-- 137, 1991. F. Rivera, M. Ismail, and E. Zapata. Parallel squared error clustering on hypercube arrays. Journal ofParallel and Distributed Computing, 8:292 --299, 1990. G. Rudolph. Parallel clustering on a unidirectional dng. In R. Grebe et al., editor, Transputer Applicationsand Systems'93: Volume 1, pages 48 7--493. lOS Press, Amsterdam, 1993. H. Nagesh, S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspa ce clustering for very large

data sets. Technical Report 9906-010, Ce nter for Parallel and Distributed Computing, NorthwesternUniversity, June 1999. X. Xu, J. Jag er and H.-P. KriegeL A fast parallel clustering algorithm for large spatial databases. DataMining and Knowledge Discovery: An International Journal. 3(3):26 3--290, 1999.

I High Per f o rma nce Da t a M in ing (V ip in Kumar and Moha mme d Zak i ) 22 9 I

References: Distributed DM J. Aronis, V. Kolluri, F. Provost, and B. Buchan an. The WORLD: Know ledge discovery from multipledistributed databases. In Florida Artificial Intelligence Research Symposium, May 1997. R. Bhatnagar and S. Srinivasan. Pattern discovery in distributed databases. In AAAI National Conferenceon Artificial Intelligence, July 1997.

Documents

dm tutoria