dm tutoria

  • Upload
    dekpo11

  • View
    264

  • Download
    0

Embed Size (px)

Citation preview

  • 8/8/2019 dm tutoria

    1/116

    Tutorial PM-3High Performance Data MiningVipin Kumar (University of Minnesota)Mohammed Zaki (Rensselaer Polytechnic)

    3 0 9

  • 8/8/2019 dm tutoria

    2/116

    High Performance Data MiningVipin KumarComputer Science Dept.University of MinnesotaMinneapolis, MN, [email protected]/-kumar

    Mohammed J. ZakiComputer Science Dept.Rensselaer Polytechnic InstituteTroy, NY, USA.zaki @cs.rpi.eduwww.cs.rpi.edu/-zaki

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 11

    Tuto r i a l ove rv iew [ Overv iew of KDD and data mining I Parallel design space Classification Associations Sequences Clusteringm Future directions and sum ma ryHigh Performance Data Mining (Vipin Kum ar and Mohamm ed Zaki) 21

    3 1 1

  • 8/8/2019 dm tutoria

    3/116

    Overview of data mining

    What is Data Mining/KDD? Why is KDD necessary The KDD process Mining operat ions and methods Core issues in KDD

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 3 1

    What is data mining?

    The iterat ive and interact ive process ofdiscovering valid, novel, useful, andunderstandable pat terns or models inMassive databases

    IHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4 I

    3 1 2

  • 8/8/2019 dm tutoria

    4/116

    W hat is data mining? Val id : genera l ize to the fu ture Nove l : wh at we don ' t know Usefu l: b e ab le to take som e ac t ion Un ders ta ndab le : lead ing to ins igh t I terat ive: takes mult ip le passes In teract ive: hum an in the loop

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 5 I

    Data mining goals Predic t ion

    - W h a t ?- O p a q u e

    Descr ip t ion- W h y ?- T r a n s p a r e n t

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 6 I3 1 3

  • 8/8/2019 dm tutoria

    5/116

    Data mining operations Ver i f icat ion dr iven

    - Validating hypothesis-Q ue ry in g and reporting (spreadsheets, pivottables)- M u l t i d i m e n s i o n a l analysis (dimensionalsummaries); On Line Analytical Processing- Statistical ana lysis

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 7 I

    Data mining operations D isc ove ry dr iven

    -Exploratory data analysis- Predictive modeling- Database segmentation- Link analysis- Deviation detection

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8 I3 1 4

  • 8/8/2019 dm tutoria

    6/116

    KDD Process

    High Performance Data M ining (Vipin Kumar and Mohammed Zaki) 91

    Data mining p rocess Understand application domain

    - Pr ior knowl edge, user goals Create target dataset

    - Se lect data, focus on subsets Dat a cleaning and transformation

    - Re mo v e no ise , out l ie rs , miss ing va l ues- Se lec t fea tures , reduce d imens ions

    Performance Data Kumar and Mohammed 10 ]igh Mining (Vipin Zaki)3 15

  • 8/8/2019 dm tutoria

    7/116

    Data mining p rocess Apply data mining algorithm

    -Associations, sequences, c lass i f icat ion,clustering, etc.

    Interpret, evaluate and visualize patterns-W ha t ' s new and in terest ing?- I te ra te if needed

    Man age discovered know ledge-C l o s e t he loopHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 11 I

    W hy M ine Data?ViewPoint . . .

    Commercial Lots of data is being collected andwarehoused.d Computing has become affordable. Competitive Pressure is Strong

    - Provide better, cus tom ized services for anedge.-Information is becoming product in i ts own

    right.High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 12 I

    3 1 6

  • 8/8/2019 dm tutoria

    8/116

    W hy Mine Data?Scientific V iewpoint... Data col lec ted and s tored at eno rmo us spee ds(Gby te /hour )

    - r emo te sens or on a sa te l l i t e- t e l e s c o p e s c a n n i n g t h e s k i e s- m i c r o a r r a ys g e n e r a t in g g e n e e x p r e s s i o n d a t a- s c i en t i fi c s im u l a t i o n s g e n e r a t i n g t e r a b y t e s o f d a t a

    Tradi t ional techniq ues are infeas ib le for raw data Data mining for data reduct ion. .

    - c a t a l o g i n g , c l a s s i fy i n g , s e g m e n t i n g d a t a- H e l p s s c i e n t i s t s i n H y p o t h e s i s F o r m a t i o n

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 1 3 1

    Origins of Data Mining Draws ideas from machine learning/AI,

    pattern recognition, statistics, databasesystems, and da ta visualization. Tradit ional Techniques may beunsuitable

    - E n o r m i t y o f d ata-H ig h Dimens ional ity o f data- Heterogen eous, Distributed nature of data

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 14 I3 1 7

  • 8/8/2019 dm tutoria

    9/116

    Data mining methods Predictive modeling (classification,regression) Segmentation (clustering) Dependency modeling (graphicalmodels, density estimation) Summarization (associations) Change and deviation detection

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki)

    Data mining techniqu es Association rules: detect sets of attributes thatfrequently co-occur, and rules among them, e.g.90% of the people who buy cookies, also buymilk (6 0% of all grocery shoppers bu y both) Seq uence mining (categorical): discov er

    sequences of events that commonly occurtogether, .e.g. In a set of DNA sequencesACGTC is fol lowed by GTCA after a gap of 9,with 30% probabil i tyHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 16 I

    3 1 8

  • 8/8/2019 dm tutoria

    10/116

    Data mining techniques Classif ication and regression: assign a new datarecord to one of several predefined categories or

    classes. Regression deals with predicting real-valued fields. Also called supervised learning.

    Clustering: partit ion the dataset into sub sets orgroups such that elements of a group share acommon set of properties, with high within groupsimilarity and small inter-group similarity. Alsocalled unsupervised learning.

    High Performance Data Mining (Vipin Kumar and Mohamm ed Zaki) 17 I

    Data mining techniques Similarity search: given a datab ase of objects,and a "query" object, f ind the object(s) that arewithin a user-defined distance of the queriedobject, or f ind al l pairs within some distance of

    each other. Deviation detection : find the record(s) that is(are) the most different from the other records,i.e., f ind al l outl iers. These may be thrown awayas noise or may be the "interesting" ones.

    High Performance Data Mining (Vipin Kumar Zaki) IIand Mohammed 183 1 9

  • 8/8/2019 dm tutoria

    11/116

    Data mining techniqu esMany other methods, such as- Neural networks

    - Gene t ic a lgor ithms-H idden Markov mode l s- Time ser ies- Bayesian networks- Soft computing: rough and fuzzy sets

    High Performan ce Data Mining (Vipin Kumar and Moham med Zaki) 19 I

    Main challenges for KDD Sca lab i l i t y

    - Eff ic ient and suff ic ient sampl ing- In-memory vs . d isk-based process ing- High performance comput ing

    Au t o m at io n- Ease of use- Using prior knowledge

    IHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 20 I

    3 2 0

  • 8/8/2019 dm tutoria

    12/116

    Tutorial overview Ove rview of KDD and data mining Parallel des ign spa ce I C lass i f ica t ion Associations Sequences Clustering Future direct ions and summaryI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 21 1

    Speeding up data mining Data oriented approach

    - D i s c r e t i z a t i o n- F e a t u r e s e l e c t i o n- F e a t u r e c o n s t r u c t i o n ( P C A )- S a m p l i n g Methods or iented approach- E f f ic i e n t a n d s c a l a b l e a l g o r i t h m s

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 22 I32 1

  • 8/8/2019 dm tutoria

    13/116

    Speeding up data mining (contd.) Methods oriented approach (contd.)

    - P a r a l l e l data mining Tas k o r con t ro l pa ra l l e l i sm Da ta pa ra l l e l ism Hybr i d pa ra l l e l i sm

    -Distr ibuted data mining Vo t i ng Me ta - l ea rn i ng , e t c .

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 23 I

    N e e d f or P a ra l le l F o r m u l a t i o n s Need to handle very large datasets. Memo ry limitations of sequen tial computerscause sequential algorithms to make multipleexpensive I/0 passes over data. Need for scalab le, efficient (fast) data miningcomputations

    - g a i n c o m p e t i t i v e a d v a n t a g e .- H a n d l e l a r g e r d a t a f o r g r e a t e r a c c u r a c y in s h o r t e r

    t imes .High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 24 I

    3 2 2

  • 8/8/2019 dm tutoria

    14/116

    Parallel Design Space Parallel architectures

    - Dis t r i bu ted memory- Shared d i sk- S h a r e d me mo r y- H y b r i d c lus te r o f SMPs

    Task and data paral lel ism Static and dynamic load balancing Horizontal and vertical data layout Data and concept partitioning

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 25 I

    Parallel Hardware Distr ibuted-memory machines

    - Each process or has loca l me mo ry and d isk- Communica t ion v ia message-pass ing- Hard to program: expl ic i t data dist r ibut ion- Goa l : min imize commun ica t ion

    Shared-memory machines- Shared g loba l address space and d isk- Comm un ica t ion v ia shared mem ory var iab les- E ase of program ming- Goal : maximize loca l i ty , min imize fa lse shar ing

    Current trend: Cluster of SMPsHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 26 I

    3 2 3

  • 8/8/2019 dm tutoria

    15/116

    Distributed Memory Architecture(Shared Nothing)

    I n t e r c o n ne c t i o n N e t w o r k ~

    1J

    D

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 27 1

    DM M: Shared Dis k Architectu re~ I n t e r c o n n e c t i o n N e t w o r k, 0 , mW / )I Performance Data Kumar and Mohammed 28 Iigh Mining (Vipin Zaki)

    3 2 4

  • 8/8/2019 dm tutoria

    16/116

    Shared Memory Architecture(Shared Everything)

  • 8/8/2019 dm tutoria

    17/116

    T a s k v s . D a t a P a r a l l e l i s m D a ta Para l l e l i s m

    - D a t a part i t ioned among P processors- Each processor performs the same work

    on local partition T a s k P a r a l le l i s m

    - Each processor performs dif ferentcomputat ion

    - Data may be (select ively) repl icated orpartit ionedHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 3 1 1

    T a s k v s . D a t a P a r a l l e l i s mPO PI

    Task Parallelism

    Uniprocessor @+1

    P2

    Data Paral le l ismI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 32 I

    3 2 6

  • 8/8/2019 dm tutoria

    18/116

    Sta t i c v s . Dyna mic Lo a d Ba la nce Static Load Balancing

    - Work is initially divided (heuristically)- N o sub sequent computation or data

    movement Dynamic Load Balancing

    - Steal work from heavily loaded processors- Reass ign to l ightly loaded process ors- Important for irregular computation,heterogeneous environments, etc.

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 33 I

    Horizontal /V ert ica l Data FormatHor izonta l Ver t i ca l

    2 N) Ma nie d 1CCK N:)

    4 Yes Ma de d 12CK I~b i

    6 N:) Manie d B:K I~b8 N~ 8 n~ 8EK Yes

    10 I~ @rge ~K Yes

    High Performance Data Mining (Vipin Kumar and Mohamme d Zaki) 34 I

    3 27

  • 8/8/2019 dm tutoria

    19/116

    D ata and concep t part it ioning Shared

    - S M P o r s h a r e d d i s k a r c h i t e c t u r e s Repl icated

    - P a r t i a l l y o r t o t a l l y Part i t ioned

    - R o u n d - r o b i n p a r t i t i o n i n g- H a s h p a r t i t i o n i n g- R a n g e p a r t i t i o n i n g

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki)

    Tutorial overview Overv iew of KDD and data min ing Paral le l design space 1Classi f icat ion A s s o c i a t i o n s Sequences Cluster ing Future d i rect ions and summaryI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 36 J

    3 2 8

  • 8/8/2019 dm tutoria

    20/116

  • 8/8/2019 dm tutoria

    21/116

    Classification learning

    Training set: set of examples, whereeach example is a feature vector ( i .e. , aset of (attr ibute,value) pairs) with itsassociated class. Model bui l t on this set.

    Test set: a set of examples disjoint fromthe training set, used for test ing theaccuracy of a model.

    I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 39

    Classif ication Models Som e m odels are bet ter than others

    - A c c u r a c y- U n d e r s t a n d a b i l i t y

    Models range f rom easy to understandto incom prehensib le- D e c i s i o n t r e e s Easier- R u l e i n d u c t i o n- R e g r e s s i o n m o d e l s- N e u r a l n e t w o r k s Harder

    I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4O I3 3 0

  • 8/8/2019 dm tutoria

    22/116

    D e c i s i o n T r e e C l a s s i f i c a t i o nSplitting Attributes

    Yes /J Refund L',Ao /J ,,,,,r=

    < 8 0 ; T ax ln o ~ 0 K ~ J

    The splitting attribute at a node isdetermined based on the Gini index.

    HighPerformanceDataMining VipinKumarand MohammedZaki) 41 ]

    F r o m T r e e to R u l e sY e s ~ Refund [ . ~

    ~-MarStSingle, D~ercedTaxlnc< 8OK / ~,= 80K

    MarriedW

    1) Re fu nd = Yes ~ NO2 ) R e f u n d = N o a n dM a r S t i n {S ing le , D ivo rced}a n d T a x l n c < 8 0 K ~ N O

    3 ) R e f u n d = N o a n dM a r S t i n {S ing le , D ivo rced}a n d T a x l n c > = 8 0 K = , Y E S

    4 ) R e f u n d = N o a n dM a r S t i n {Mar r i e d } ~ NO

    I High DataMining VipinKumarand Mohammed aki) 42 1erformance33 1

  • 8/8/2019 dm tutoria

    23/116

    Classification algorithm Bu ild tree

    - S ta r t with data at root node- S e l e c t an at t r ibute and formulate a

    logical test on attr ibute-B ra n ch on each ou tcome o f the test,

    and m ove subse t o f ex am p lessat is fy ing that outcome tocor respond ing ch ild node

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 43 I

    Classification algorithm- Recurse on each child node- Repeat until leaves are "pure", i.e., haveexample from a single class, or "nearly

    pure", i .e., majority of examples are fromthe same class

    Pru ne tree-Remove subt rees that do not improveclassif ication accuracy- Avoid over-fitting, i.e., training set specificartifacts

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 44 i3 3 2

  • 8/8/2019 dm tutoria

    24/116

    B uild t ree Evaluate spli t-points for al l attr ibutes Select the "best" point and the "winning"

    attribute Split the data into two Breadth/depth-f i rst construct ion CRITICAL STEPS:

    - F o r mu la t i o n o f g o o d sp l it t e s t s- S e l e c t i o n m e a s u r e f o r a t t r ib u t e s

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 45 [

    How to capture good spl i ts? Oc cam 's razor: Prefer the simplest

    hypothesis that f i ts the data Minimum message/descript ion length

    - Da ta se t D- H y p o t h e s e s H 1 , H 2 , .. . , H x d e s c ri b i ng D- M M L ( H i ) = M l e n g t h ( H i ) + M l e n g t h ( D I H i )- P ic k H k w i th m i n i m u m M M L

    Mlength given by Gini index, Gain, etc.High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 46 I

    333

  • 8/8/2019 dm tutoria

    25/116

    Tree pruning u sing M DL Data encoding: sum classif icat ion errors Model encoding:

    - E ncode the tree structure- Encode the split points

    Pruning: choose smallest length opt ion-Conver t to lea f- P r u n e left or right child- Do nothing

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 47 I

    Hu nt's M ethod Attributes: R efund (Yes, No), Marital Status(Single, Married, Divorced), Taxable Income Class: Cheat, Don 't Che at

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 48 I3 3 4

  • 8/8/2019 dm tutoria

    26/116

    W hat's really happening?Mari ta l St a t u s

    + 0O o C

    + 00 + 00+ +

    + +Marrie, + + 0

    0 0 00 0 0 0 0+0 0 0

    0 0 0 00O

    O

    Cheat IDon' t C heat I

    I -'Income< 80K IHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 49 t

    Finding good sp lit points Us e Gini index for partition purity

    G i n i ( S ) = l - ~ p ~i=1

    G i n i ( S l , S 2 ) = ~ G i n i ( S l ) + n ~ G i n i ( S 2 )n n

    If S is pure, Gini(S) = 0 F ind split, point with minimum Gini Only need class distributions

    High Performance Data Mining (Vipin Kumar and Mohamm ed Zaki) 50 I335

  • 8/8/2019 dm tutoria

    27/116

    Finding good sp lits pointsMa rit t l Status Mar i l' + 0

    O + 0 4 - 00 + 00+%,

    4--++ + (

    ) 0 0

    o o 00 0 0 000

    O

    ~ Cheat~,Don't Cheat

    d Status' 0 0 04- 00 + 0 0 0 0 0 0 o

    4 - 0 0 0 O + 0 00 a 0 a e~* * . * o

    + T ~ o+I n c o m eN O Y E S N O Y E S

    L e f t 1 4 9 T o p 5 2 3Ri gh t 1 1 8 Bottorr 10 4

    N c o m e

    G ini(split) = 0.31 G ini(split) = 0.34High Performance Data Min ing (Vip in Kumar and Mohammed Zaki ) s , I

    Categorical Attributes:Computing Gini IndexFor each distinct value, gather counts foreach class in the dataset

    Us e the count matrix to mak e decisionsMulti-way split II

    ITwo-way split(find b est partition of values)

    ; a r Ty p (

    2 I 2I G i n i I o , 4 : t eI

    High Per fo rmance Da ta Min ing (V ip in Kumar and Moham med Zak i ) 52

    33 6

  • 8/8/2019 dm tutoria

    28/116

    Continuous Attr ibutes:ComputingGini Index . . . For efficient computation: for each attribute,- Sort the attr ibute on values

    - L inearly scan thes e values, each t ime updating the countmatrix and computing gini index

    - Cho os e the spli t position that has the least gini index

    Sorted ValuesSplit Position.=

    H

    C 4 . 5 Simpl e depth-f irst construct ion. Sorts Cont inuous A t t r ibutes at each node. Ne eds ent ire data to f it in me mo ry . Unsuitab le for Large Datas ets .

    - Needs out -of -core sort ing. Classification Accuracy shown to

    improve when entire datasets areused!

    I High Mining (Vipin Mohammed Zaki) 5 4 Ierformance Data Kumar and3 3 7

  • 8/8/2019 dm tutoria

    29/116

    SPRINT [Shafer,Agrawal,Mehta]Attr ibute Lis ts :

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 55 I

    SPRINT (contd.) The arrays of the continu ous attr ibutes are pre-sorted.

    - The so r ted o rder i s ma in ta ined dur ing each sp li t . The classif ication tree is grow n in a breadth -f irst fashion. Class information is club bed with each attr ibute l ist. Attr ibute l ists are physically spl i t am ong nodes. Spli t determ ining phase is just a l inear scan of lists at each

    node. Hash ing sch eme used in spl i t t ing phase.

    - t i ds o f the sp li t t i ng a t t r i bu te a re hashed wi th the t ree node as thekey.

    lookup table- rema in in g a t t r i bu te a r rays a re sp l i t by query ing th i s hash s t ruc tu re .

    I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 5 6 ]3 3 8

  • 8/8/2019 dm tutoria

    30/116

    SPRINT Di sadvantages Siz e of hash tabl e is O (N) for top levels

    of the tree. If hash tab le does not fit in memory

    (mostly true for large datasets), thenbuild in parts so that each part fits.- Mu l t ip le expens ive I /O pass es over the

    entire dataset .I High Performance Data Mining(Vipin Kumar and Mohammed Zaki) 57 I

    C onstruct ing a De cis ion T ree inParallel

    /1

    oJID

    m ca tegor ica l a t t r ibutes

    Good Bad

    Partitioning of dataonly- global reduction p e rnode is required- large number ofclassification t r e enodes gives highcommunication cost

    I Performance Data Kumar and Mohammed 58 Iigh Mining ( V i p i n Zaki)3 3 9

  • 8/8/2019 dm tutoria

    31/116

    C onstruct ing a D ec i s ion T ree inParal le l

    l 0 , 0 0 0 t r a i n in g r eco r d s

    7,0q ds -Partitioning ofclassification tree nodes

    n a t u ra l c o n c u r re n c yl o a d i m b a l a n c e a s t h ea m o u n t o f w o r k a s s o c i a t e dw i t h e a c h n o d e v a r i e s

    - c h i ld n o d e s u s e t h e s a m ed a t a a s u s e d b y p a r e n t n o d e

    - loss of locality- high data moveme nt cost

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 59 I

    ChallengesClassifier

    in Constructing Parallel Partit ioning of data only

    - l a rg e n u m b e r o f c l a s s if i c a ti o n t r e e n o d e s g i v e s h ig h c o m m u n i c a t i o nc o s t

    Partitioning of class ification tree nodes- n a t u ra l c o n c u r re n c y- l o a d i m b a l a n c e a s t h e a m o u n t o f wo rk a s s o c i a t e d w i t h e a c h n o d e

    v a r i e s- c h il d n o d e s u s e t h e s a m e d a t a a s u s e d b y p a re n t n o d e

    - loss of locality- high data movement cost

    How do we efficiently perform the com putation inparallel?I High Mining (Vipin Zaki) 60 Ierformance Data Kumar and Mohammed

    3 4 0

  • 8/8/2019 dm tutoria

    32/116

    Synchronous Tree Construction ApproachProc 0 Proc 1 Proc 2 Proc 3

    ~ - i i i ~ i i i ~ : : : ~ + N requireddataovement isClas s Distribution Information

    II - - Load imbalance

    y i c a n b e e l im i n a t e d b y b r e a d t h -Proc 0 Proc 1 P~oc 2 Proc 3 f ir s t e x p a n s i o n

    I igh communication cost. b e c o m e s t o o h ig h in l o w e r p a r ts' : , ~ . . . . . . . . . . . . . . . . " . . . . : : o f t h e t r e e" - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - - "

    C l a s s D i s t r ib u t i o n I n f o r m a t i o n

    Partition Data Across Processors

    I H i g h P e r f o r m a n c e D a t a M in i n g V i p i n K u m a r a n d M o h a m m e d Z a k i ) 6 1 I

    Partitioned Tree Construction Approachi . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . ; ~ . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . = =

    P r o c O Proc l Proc2 Proc3I!t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . :ProcO Proc l Proc2 Proc3

    tprocO prowl Proc~ Proc3I

    Partition Data a n dN o d e s- I - Highly concurrent

    High communication costdue to excessive datamovements

    - Load imbal ance

    H i g h P e r fo r m a n c e D a t a M in i n g V i p in K u m a r a n d M o h a m m e d Z a k i) 6 2 I3 4 1

  • 8/8/2019 dm tutoria

    33/116

    Hybrid Parallel Formu lation

    (Computation Frontier at depth 3

    ]

    Partition 1 Partition 2

    Synchronous TreeConstruction Approach Partitioned TreeConstruction ApproachHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 6 3 1

    Load Balancing

    CP a r t it io n 1 P a r t i t i o n S t e p 1 : e x c h a n g e d a t a b e t w e e n S t e p 2 : l o a d b a l a n c e w i th i n

    p r o c e s s o r p a r t i ti o n s p r o c e s s o r p a r t i ti o n s

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 6 4 I3 4 2

  • 8/8/2019 dm tutoria

    34/116

  • 8/8/2019 dm tutoria

    35/116

    Speedup ComparisonParallel Algorithms

    of the Three

    ~ e d u p for 0.8 malon examp~s , partitioned - " -+ - ' , hybr i d - " -o - ' , synch arsus - " -x-". . . . . . . /4

    2 4 6 N o O I P r~ e s s o I O 1 2 1 4 1 6

    Speedup for 1.6 rnllione x a n ~ , part~ioned - " ~' . h yb ri d - " -c -' . s ' ~ - " - x- "

    4 6 8 10 12 14 16NO O~ Promisors

    0.8 million examples 1.6 million examplesH i g h P e r f o r m a n c e D a t a M in i n g ( V ip i n K u m a r a n d M o h a m m e d Z a k i ) 6 7 I

    Splitting C riterion V erification in theHybrid Algorithm

    Spl i tt ing Cr i t e r ion Ra t io = ~ C ommunica t ion Cos tMov ing Cos t + L oad Balanc ing~J~ume~ ot ~o~,ng at dJr~mnt v~o .q o~ nu"bo, ~ 0.8 mlWon 1~u~qplea7O0

    600

    S00

    30O

    2OO

    ,%

    // ~// "/ "

    -2 0 2 4 6 8 10bg2(Splmng C, ' i l e .~ R~o), x , .O > ncx, .1

    ~ IO ~olllling at dllf enmf ~mlul~ ol m no , 1 6 ~ 1 6 m il t on e xa nl pl es

    ' i l / J / ~/ , //-2 0 2 4 6 6 10Io~( Sp4111ngC411eriaRatk )), x . .O -> ~ 10 .8 mi l l i o n ex a mp le s on 8 pro ce s so rs 1 . 6 mi l l i o n ex a mp le s on 16 pro ce s so r s

    I H i g h P e r f o r m a n c e D a t a M in i n g ( V ip i n K u m a r a n d M o h a m m e d Z a k J ) 6 8 ]3 4 4

  • 8/8/2019 dm tutoria

    36/116

    Speedu p of the Hybrid Algorithm w ithDifferent Size Data Sets

    1 0 0

    4 0

    20

    Speedup c~rves for di f ferent sized d a t a . s e t s0.8 million exam ples --e---1,6 mi l lion examples , .~-.3.2 mi l l ion examples6.4 mi l l ion examples12.8 minion examples25.6 mi l l ion examples - ix-, -

    . . . . . -....-''" ....A

    ,-" .A-. ... ... . . ' / ". .~ - . ~ '. -" ~ . = ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    i ) , - . . . . . . .

    i ! i i i i20 4 0 6 0 80 100 120Number of proceesols 1 4 0

    H i g h P e r f o r m a n c e D a t a M i n i n g ( V i p i n K u m a r a n d M o h a m m e d Z a k i ) 6 9 I

    Scaleup of the Hybrid AlgorithmR u n U m e s o f o u r a lg o r i t h m f o r 5 0 K e x a m p l e s a t e a c h p r o c e s s o r

    90 ~ I !

    80 . . . . . . . . . : . . . . . . . . : . . . . . . . . . . . . . ! . . . . . . . . . ~ . . . . . . . . . . . . . . . . . : . . . . . ~ . . . . . . .

    7 0 . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    ~ 5 0 . .. .. .. .. .. .. .. .. .. .. .. .. .. .. = . .. .. .. .. . ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ~ . .. .. .I :t ~ 4 0 . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    2 ( ] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; . . . . . . . . . . , . . . . . . . . .

    1(~ . . . . . . . . . , . . . . . . . ~ . . . . . . . i . . . . . . . . ~ . . . . . . . ~ . . . . . . . . . i . . . . . . . :: . . . . ~ . . . .! i : i i i ii I ; I I i i

    0 1 2 3 4 5 6 7 8 9N u m b e r o f Pr o c e s s o m

    P e r f o r m a n c e D a t a M i n i n g ( V i p i n K u m a r a n d M o h a m m e d Z a k i ) 7 0 Ii g h

    3 4 5

  • 8/8/2019 dm tutoria

    37/116

    Summary of Algorithms forCategorical Attributes Synch ronous Tree Construction Approach

    - no data m oveme nt required- high comm unicat ion cost as tree becomes bushy

    Partitioned Tre e Construction A pproach- processors work independen tly once partit ioned com pletely- load imbalance and high cost of data moveme nt

    Hyb rid Algorithm- combines good features of two approaches- adapts dynamically according to the size and shape of trees

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 71 I

    Handling Continuous Attributes Sort continuous attributes at each node of the

    tree (as in C4.5). Expensive, henceUndesirable!

    Discretize continuous attributes- C L O U D S (Alsabti, Ranka, and Singh, 1998)- S P E C (Srivastava, Han, Kumar, and Singh, 1997) Us e a pre- sort ed list for each cont inuous

    attributes- S P R I N T ( S h a f e r , Agrawal, and Mehta, VLDB'96 )- S c a l P a r C (Joshi, Karypis, and Kumar, IPPS'98)

    !High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 7 2 I3 4 6

  • 8/8/2019 dm tutoria

    38/116

    Design Approach Goal: Scalability in both Runtime and

    Memory Requirements. Parallel izat ion overhead: To = P Tp - Ts T o ~ O (Ts) for scalab ility. Per process or

    overhead should not exceed O(Ts/P). Two comp onents of To:

    - L o a d I m b a l a n c e .- C o m m u n i c a t i o n t i m e .

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 73 [

    Load BalancingB Iv a l u e rid cid

    5 1 010 8 lPO 15i 4 02o I 6 025 2 0S P " - ~ - :o,

    PI 3 5 3 17 0

    4 5 5 l

    B 2value rid cid

    6 o l l12 1 0

    PO 18 8 124 6 0

    _ 3 0 4 03 6 3 1

    P l 4 2 2 048 5 15 4 7 0

    SP : s p l i t p o i n trid : record idcid : class label

    BI B2SP-- -~ 5 1 0 12 1 0 PG10 8 I 18 8 1 30 0 l 6 0 1

    35 3 1 36 3 1,15 4 0 PO 24 6 0 1 40 7 0 48 5 1PO 20 6 0 30 4 0- i 2 5 2 0 . p ~ 4 2 2 0 4 5 5 1 54 7 0P1 .~ ~ t ~ .

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 74 I3 4 7

  • 8/8/2019 dm tutoria

    39/116

    P a r a l l e l S P R I N T At tribute lists are partitioned amo ng process ors

    - Ea c h proc e ssor ge ts N /p records of each a t t r ibute l is t A ttribute lists of continuous attributes are pre- sort ed Split Determining Phas e

    - C a te gor ic a l a t t r ib u te s Local construction of count matrices followed b y a reduction to

    add them up- C on t inuou s a t t r ib u te s

    Prefix-scan followed by local Gini computation s followed by agini index reduction to find the maximum point

    I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 75 ]

    Parallelizing Split DeterminationPhase Easy.

    P0

    PIP2

    ICount Matrices I AgeCarType local global lcalvalue rid cid I value rid cid ~ 0 1S p o O l 0 1 t ~ I 19 2 1 ~Fa m 1 i 01 Faro Faro i 28 8 0Fam 2 1 _ I -Fam 3 0 0 I 33 1 0 [ ~Sp o [ - ~ Parallel iS p o 4 1 Fam lll01 Reduction i P1 38 3 1 RFamSP 5 0 _6 r-m-m I - 40 6 1" - L R ~ i0 1 I 5 0 4 0

    S p o ~ _ ~ P2 5 8 7 1Fa m 7 1 Fam t._LLeJ IFaro 8 0 i 70 0 0Categorical Attribute

    Count Matrices global

    Parallel o 1P,o. iScan o 1

    Continuous A ttr ibute

    High Performance Data Mining (Vipin Kumar and Zaki) J|Mohammed 763 4 8

  • 8/8/2019 dm tutoria

    40/116

  • 8/8/2019 dm tutoria

    41/116

  • 8/8/2019 dm tutoria

    42/116

    Getting the required entries of thehash table

    T h e r e q u i r e d e n t r i e s a r e t r a n s f e r r e d i n t o t w o s t e p s- From spl i t t ing attr ibute order to ti d sorted order- From t id sorted order to attr ibute l is t order

    High Performance Data M ining (Vipin Kumar and Mohammed Zaki) 8, I

    This Design is Inspired by..Communication Structure of Parallel SparseMatrix-Vector Algorithms.

    0 1 2 3 4 5 6 7 8o x 01 X 0 P0 3 o x i4 0 X P15 X 06 x o l i7 X O P28 0 X 0

    X : Salary O : Age ~ : Node Table Entries[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8 2 I

    3 5 1

  • 8/8/2019 dm tutoria

    43/116

    Parallel Hashing Paradigm[ ] Distr ibuted Hash Tab le .[ ] Hash Function: h(k) = (Pi, I )[ ] Construction:

    - ( k , v ) - -> ( p i , I ) - -> f o r m s e n d b u f f e r s { ( I , v )} f o r e a c h P i - -> a l l -t o - a l l - p e r s o n a l i z e d c o m m u n i c a t i o n .

    . . Enquiry:- ( k ) - -> ( P i , I) - -> f o r m { ( I ) } b u f f e r s f o r e a c h P i - -> a l l -t o - a l l -

    p e r s o n a l i z e d c o m m - - > e a c h P i r e p l a c e s r e c e i v e d { ( I) } w i t h{ ( v ) } - - > a l l -t o - a l l - p e r s o n a l i z e d c o m m .

    If each processor has m key s to hash, thentime is O (m ) if m is D,(p); i.e. D,(p 2) overal lkp.v.~

    , j - . . . . . . . . . . . . . . . .High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8 3 1

    Applying the paradigm toScalParC : U p dateSalaryrid Salary Age cid value rid cid

    0 24542 7 0 0 24542 0 01 9 8816 33 0 P 0 49 241 2 12 49241 19 1 649 11 8 03 126146 38 1 9 47 6 6 4 04 9 4 7 6 6 5 0 0 P 1 9 7 6 7 2 5 05 9 7 6 7 2 2 4 0 9 8 8 1 6 1 1 06 1 3 6 8 3 8 4 0 1 S P " ' i ~ - 1 2 6 1 4 6 3 1 17 153032 58 1 ~2 136838 6,18 649 11 28 0 153032 7 l

    P0

    Pl

    P2

    A g e N o d e T a b l evalue rid cid rid kid

    19 2 1 0-24 ~ 0 P0 1 -28 0 2 -33 1 0 3 -38 3 1 P I 4 -4 0 6 1 5 -5 0 4 0 6 ' -5 8 7 1 P 2 7 -70 00 8, -

    (a) (b)h as h b u f f er sPO P2 0 2 1Po [(O,L)I(2,L)(2,L )[ m[(O,L)[(2,L)I(I,L ) N o d e T a b l e

    P0 P1 P2P0 PI 4 5 3 31415 17181, I ( 1 . L ) I . . L ) I ( 2 . L ) I comm ~ , I ( 1 . L ) I ~ 2 . L ) I < 0 . R > update r i d ] O I 1 1 2 1 [ 6k ia lL I L lL I R IL I L IR IR IL I8 6 7 I II P2I c o , . ) ~ o , ~ ) 1 < 1 . . ) 1 ~ l ( 2 , L ) l < o , ~ ) l . , ~ ) l()

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 84 I

    35 2

  • 8/8/2019 dm tutoria

    44/116

  • 8/8/2019 dm tutoria

    45/116

    A worst case scenario for UpdatingHash Table

    I O ne Processor might need to s end O (kN/p) data! HappensInfrequently. BIvalue rid cid

    5 1 010 8 l

    P O 1 5 4 020 6 02 5 2 0S P ~ . . . . .30 0 135 3 1P1 . . . .4 0 7 045 5 ' 1

    B 2value rid cid~4314 0. . . . P 0 21 0 1

    28 5 1_ _ 3 5 7 0

    42 2 0P1 49 8 15 6 1 0

    6 3 6 0

    BI B2 ~ BI B2Sp ~1 0-8 . 1 . - 42 2 0 - 30 0 1 - I 7 3 1

    35 3 1 P0 21 0 1.15 . 4 . 0 49 8 1 P1 40 7 0 28 5 1 . , , , t_SpP0 20 6 0 P1 56 1 0 35 ? " 0 "25 2 0 63 6 0 45 5 1 -p-11 I ~ I . . . . P - 1 ~ ~ I I/ x z

    High Performance Data Mining (Vipin Kumar and M o h a m m e d Z a k i) 8 7 i

    Categorical/Continuous andContinuous/Categorical Specia l cases of the Cont inuous/Cont inuous

    - Categorical /Co ntinuous does not requirecommunication for loading the hash table

    - Continuous /Categorical does not requirecommunication for inquiring the hash table

    1High Performance Data Mining (Vipin Kumar and M o h a m m e d Z a k i) 8 8 [3 5 4

  • 8/8/2019 dm tutoria

    46/116

    Algorithm: Level -w iseCommunications T ree is bu i l t in a bread th- f i rs t man ne r . A t eac h leve l o f the dec is io n t ree- Coun t Matrices for all attr ibutes for all nodes are

    reduced in one single communicat ion approach.- Lo ading the hash-table for all the nodes iscombined into a single communication operation- Inquiring the hash-table to spl i t a particularattribute list is combined in a single

    communicat ion operat ion- A total of k + 1 All-to-Al l personalizedcommunicat ion operat ions are performed for each

    level of the treeHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 89 I

    Structure of the Algorithm Sort Contin uous Attr ibutes (Pre-Sort) do level-wise while (at least one node requiring spl i t)

    - Com pute count matrices (Find-Split I)- Compute best gini index for nodes requiring spl i t(Find-Split II)- Parti t ion spl i tt ing attr ibutes and update NodeTable (Perform-Spli t I)- Parti t ion non-spl i tt ing attr ibutes by inquiring NodeTable (Perform-Spli t II)

    end doHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 90 I

    3 5 5

  • 8/8/2019 dm tutoria

    47/116

    ExampleSalaryvalue rid cid

    PI PIS P ~P2 P2

    Ag e Counl Matricesl o c a l global

    2 4 5 0 R2818 0 0 I 0 I33,,0 "'~ i [ ]3813 1o-~N IN5 8 1

    7 0 0

    Node a b l erid kid

    Salary Age ~ N o d e T a b l e

    PI ~ ~ [ ' ~ p ~

    ,'" " ' . Salary A g e"' " "' " I 15303217111 P21 ~ 17111

    High Performance D ata Mining (Vipin Kumar and Moha mmed Zaki) 91

    Experimental Resu lts E x p e r i m e n t s w e re p e r f o rm e d o n a 1 2 8 -

    p r o c e s s o r C r a y T 3 D T ra i ni n g s e t s w e re s y n t h e t i c a l l y g e n e ra t e d

    - Each contained only continuous attributes (5 -9 )- The were two possible class labels

    High Performance Data Mining (Vipin Kumar Zaki) l]and Mohammed 923 5 6

  • 8/8/2019 dm tutoria

    48/116

    Parallel Runtime

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 93 ]

    C onstant size/p rocess or

    I Performance Data Kumar and Mohammed 94 Iigh Mining (Vipin Zaki)3 5 7

  • 8/8/2019 dm tutoria

    49/116

    Different Number of Attributes

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 951

    SMP Parallel Design Space Data paral le lism: wi th in a t ree nod e

    - Spl i t a t t r ibute s : ver t ica l BASIC Fixed Win dow Moving Window

    -S p l i t da ta (a t t r i bu te li sts) : hor i zon ta l Task para l le l ism: between t ree nodes

    - S U B T R E E Stat ic vs. dyn am ic load ba lanc ing

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 96 I35 8

  • 8/8/2019 dm tutoria

    50/116

    SPRINT: Attribute ListsT r a in in g S e t A t tr i b u t e is t s

    T i d A g e C a rT y p e C l a s s

    2 4 3 s p o r t s H ig h

    4 3 2 t r u c k Low

    A g e C l a s s T i d C a r T y p e [ C l a ss i T i d

    2 os ~ I r n~. . . . , , |

    3 2 L ow 4 ~ : : f ami l ~ : :7: l: ! !4 3 t r u c k I L o w 4

    c o n tin u o u s so r t e d ) c a t e g o r i c a l o r i g o r d e r )High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 97 I

    Splitt ing Attribute LiAttribute l i s ts lbr node 0

    A g e C l a s s T i d C a r T y p e C l a s s T i d

    32 Low 4 family Low 343 High 2 truck Low 4

    T idC h i l d

    Attribute l i s ts for node 1

    H a s h T a b l e

    s tsD e c i s i o n T r e e

    A g e < 27.5

    IAttribute l i s ts for node 2A g e C l a s s T i d32 Low 443 High 2

    Car Type [ Class T idHigh 2 1t r i c k Low 4

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 98 J3 5 9

  • 8/8/2019 dm tutoria

    51/116

    SPRINT: File per Attribute

    O O 0 O 0 ~ O O 0TOTAL FILES PER ATTRIBUTE: 32

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 9 9 1

    Optimized Fi le Usage

    @ @@@TOTAL FILES PER ATrRIBUTE: 4

    I Performance Data Kumar and Mohammed 100 Iigh Mining (Vipin Zaki)3 6 0

  • 8/8/2019 dm tutoria

    52/116

    BASIC: Data Para l l e l G iven curren t leaf front ier Atomical ly acquire an at t r ibute andevalu ate it for all leave s Master f inds winning at t r ibute and

    constructs hash table Atomical ly acquire an at t r ibute and spl i t

    its list for all leaves Barrier synchronizat ion be tween pha ses

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 101 ]

    BASIC (Tree View)P={O,L2~3}

    BARRIER| l | l l | | | |

    I l l l l l l l

    I I I l l l /G J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 102 1

    36 1

  • 8/8/2019 dm tutoria

    53/116

    BASIC (Level View, A=3, P=4)II PO(4) PI(4) II P2(4) n p3(o)

    @

    _,eafI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 103 I

    Fixed W indo w : Da ta Paral le l Par ti tion leave s in to b lock s/win dow of K Dynamical ly acqui re any at t r ibute for

    any leaf wi th in current b lock andevaluate i t

    L ast proce ssor to wo rk on a leaf notesthe winning at t r ibute and bu i lds thehash tab le Spl i t data as in BASIC

    Bar rier synchron iza t ion be tween eachblock of K leaves

    I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 104 I3 6 2

  • 8/8/2019 dm tutoria

    54/116

    FW K (Tree V iew)e={o,l,z,3}

    I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 105 I

    FW 2 (Level View )i l r,o(4) D el(4) e2(2) P3(2)

    I 1 ~ ' IL e a fW i n d o w = 2

    Leaf Lea1

    High Data Mining (Vipin Kumar and Mohammed Zaki) 106 IPerformance3 6 3

  • 8/8/2019 dm tutoria

    55/116

    M oving W indow : Data Paral lel Part i t ion leaves into blocks/window of K Dyn am ical ly acqui re any att r ibute forany leaf, say i , within current block Wait for last block's i - th leave L ast proc ess or to wo rk on a leaf notes

    the winning at t r ibute and bui lds thehash tab le

    Spl i t data as in BASIC No barr iers, only condi t ional wai t

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) ~ 0 7 1

    MWK (Tree Level)P={0, I ,2,3}~ | e o | l a | | e p

    C H R O N I Z A T I O N .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . , . . . . . ,

    . . . .

    , ' ~ '" , 1 ' " J '- ' ' " " ~ '~. -0.10 ,:~ .0Di l l l l l l l l n i l l l n | l l d l l l i b l B i l l l i nd i l l l l lHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 108 ]

    3 6 4

  • 8/8/2019 dm tutoria

    56/116

  • 8/8/2019 dm tutoria

    57/116

  • 8/8/2019 dm tutoria

    58/116

    Experimental DatasetsDataset Function #Attr

    F2A8D1M F2F2A32D250K F2F2 A6 4 D12 5 K F2

    F7A8D1 M F7F7A32D2 50K ; F7F7 A6 4 D12 5 K F7

    Num DBSize Num MaxRecords Levels Nodes1000K 192MB 4 22 5 0K 192 M B 4 2125K 192MB 4 2

    1000K 192MB 6 0 46 622 5 0K 192 M B 5 9 8 0212 5 K 192 M B 5 5 38 4

    8326 48

    326 4

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 113 I

    Setu p and Sort T imeDataset Setup(sec)F2 A8 D1M 7 2 1

    F 2 A 3 2 D 2 5 0 K 6 8 5F 2 A 6 4 D 1 2 5 K 7 0 5

    F 7 A 8 D 1 M 9 8 9F 7 A 3 2 D 2 5 0 K 8 3 8F 7 A 6 4 D 1 2 5 K 6 7 2

    Sort(sec)6 335 986 2 68 177 8 06 36

    Tota l Setup% Sor t%( s e c )3 5 9 7 2 0 % 1 8 %3 5 8 4 1 9% 1 7 %3 6 6 5 1 9% 1 7 %

    2 3 36 0 4 % 4 %2 4 7 0 6 3 % 3 %2 2 6 6 4 3 % 3 %

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 114136 7

  • 8/8/2019 dm tutoria

    59/116

    Paral lel PerformanceF2A64D125KID MW4 SUBTREE

    2500 2500020000~ 2 0 0 0 - ~

    ,, , 1 5 0 0 = t=ooE . _Ei= = ' -p, 1 00 0 ._ 1oo00-, p,= 500 =,>f ~o.

    0 0,

    F7A64D125K[D MW4 SUBTREE

    1 2 3 4 , 2 3 4Number o f ProcessorsNumber o f Processors

    I High Performance Data Mining (Vipin Kum ar and Moham medZaki) 115 I

    Paral le l Performan ceA64D125K A64D125KI M W 4 - F 2 SU B T RE E . F 2 I - - m - F 2 SU B T RE E . F 2

    MW4.F7 ~ SUBREE.F7 ~ MW4-F7 ~ SUBREE-F74 3 . 5

    3.52 3' "~ ~ 2

    ~ 1.5

    0 . 5 ~ 0 . 50 . . . . 0 . . . .I 2 3 4 I 2 3 4

    Number of Processors Number of ProcessorsI High Performance Data Kumar and Mohammed Iining (Vipin Zaki) 1163 6 8

  • 8/8/2019 dm tutoria

    60/116

    Tutorial overview Overview of KDD and data mining Paral lel design space Classif icat ion l Ass ociat ions Sequences Clustering Future di rect ions and summaryl High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 1 1 7 I

    What is association mining? Given a set of i tems/attr ibutes, and a set

    of objects containing a subset of thei tems

    Find ru les: if I1 the n 12 (sup , con f) I1, 12 are s ets of items I1, 12 have suf ficient suppo rt: P(11+12) Rule has sufficient confidence: P(12111)

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 118 I3 6 9

  • 8/8/2019 dm tutoria

    61/116

    Association mining Us er specifies " interestingness"- Mi ni mu m support (minsup)

    - Minimum confidence (minconf) F ind all frequent itemsets (> minsup)

    -Exponent ial Search Space-Computat ion and I /O Intensive

    Generate strong rules (> minconf)- Relat ively cheapI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 119 I

    Association Rule Discovery: Supportand C onfidence

    i i l i!!!! !~ iiiii! !!!i I~i! i ! !i iill ii ii!i !! l AsseiatinRulei~ Xs i a ; Y IS - -

    Example:{Diaper,Milk}=:~,,~Beertr(Diaper, Milk, Beer) 2=- -=0 .4Total Number of Transactions 5

    ty (Diaper, Milk, Beer) = 0.66o-(Diaper, Mil k) [

    I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 120 [37 0

  • 8/8/2019 dm tutoria

    62/116

    Handling Exponential Complexity G iven n trans action s and m di f ferent items:

    - num ber of possible association rules: O(m2 m-l)- c o m p u t a t i o n c o m p l e x i t y : O(nm2m)

    System at ic search for a ll pat terns, b ased onsupport constra in t [Agarwal & Srikant]:- I f {A ,B} has su pp or t at least (z, then both A and B

    ha ve sup por t a t l ea s t t~.- I f e i t h e r A o r B h a s s u p p o r t le s s t h a n ~ , t h e n { A , B }

    has suppor t l e s s t han t ~ .- U se pa t t e rn s o f k - l i t ems t o f i nd pa t t e rns o f k i t ems .

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 121 I

    A priori Principle Col lec t single i tem counts. Find large items. F ind cand idate pa i rs, count them = > largepairs of i tems. F ind cand idate t rip le ts, count them => largetriplets of i tems, And so on... G uiding Principle: Ev e ry subse t o f af requ e nt i te m set has to be f req ue nt.

    - U s e d f o r p r u n i n g m a n y c a n d i d a t e s .

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 122 I37 1

  • 8/8/2019 dm tutoria

    63/116

    Illustrating Apriori PrincipleItems (1-itemsets)

    I Minim um Suppo rt = 3 IIf every subset is considered,

    6C1 + 6C2 + 6C3 = 41With support-based pruning,6 + 6 + 2 = 1 4

    Pairs (2-itemsets)

    ~ i l Triplets (3-itemsets)

    ".%I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 123 ]

    Count ing Candidates F requent Itemsets are found by counting

    candidates. Simple way:

    - Search for each candidate in each transaction.E x p e n s i v e ! l !

    T r a n s a c t i o n sN

    IM

    C a n d i d a t e s

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 124 I3 7 2

  • 8/8/2019 dm tutoria

    64/116

    Assoc ia t ion Rule Di scovery: Hashtree for fast access.

    C a n d i d a t e H a s h T r e eH a s h F u n c t i o n l

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 125 I

    Assoc ia t ion Rule Di scovery: Subsettransaction

    OperationH a s h F u n c t i o n

    : ,9

    High Mining (Vipin Kumar and Mohammed Zaki) 126 }erformance Data3 7 3

  • 8/8/2019 dm tutoria

    65/116

    Association Rule Discovery:O p eration (con

    transact ion

    SubsetHash Function

    \1 2 4 1 5 54 5 7 4 5 8

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 127 J

    ParallelRules Formulation of Association Need"

    - Huge Transact ion Dat aset s (10 s of TB)- L a r g e N u m b e r o f C a n di da t e s .

    D at a D is t r ibu t ion:- P ar t i t i on the Tr ansac t ion D a tab ase , o r- Part it ion the Can dida tes , or- Both

    I Performance Data Kumar and Mohammed 128 Iigh Mining (Vipin Zaki)3 7 4

  • 8/8/2019 dm tutoria

    66/116

  • 8/8/2019 dm tutoria

    67/116

  • 8/8/2019 dm tutoria

    68/116

    Parallel Association Rules: IntelligentData Distribution (IDD) Data Distribution using point-to-pointcommunication. In te l ligent par t i t ion ing of can dida te sets.

    - P a r t i t i o n i n g b a s e d o n t h e f i r st i t e m o f c a n d i d a t e s .- Bitmap to keep track of local candidate tems.

    Prun ing at the root o f can dida te hash t ree us ing theb i tmap.

    Su i tab le fo r s ing le da ta source such as da tab aseserver.

    W i th sma l ler can didate set , load ba lancing isdif f icult .High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 133 I

    IDD: I l lustration

    C o u n t

    -I

    ~ tDat aShif t

    -fto-AllBroadca st o f C a n d i d ~High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 134 I

    3 7 7

  • 8/8/2019 dm tutoria

    69/116

    Filtering Transactions in IDDtransactionbitmask~

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 135 ]

    Parallel Association Rules: HybridDistribution (HD) Ca ndid ate set is part i t ioned into G groups to

    just f i t in main memory- Ensures Good load balance w ith smallercandidate set.

    L og ica l proce ssor m esh G x P/G is formed. Perform IDD a long the co lum n process ors

    - Data movement among processors isminimized. Perform CD a long the row proce ssors

    - Smaller number of processors is globalreduction operation.[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 136 ]

    3 7 8

  • 8/8/2019 dm tutoria

    70/116

    HD: I l lustration

    8o ~.

    PIG Processors per GroupCDalongRows

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 137 ]

    Parallel Association Ru les:Experimental Setup

    12 8 -processor Cray T3D- 150 MHz DEC Alpha (EV4)- 64 MB of main memory per processor- 3-D torus interconnection network with peak unidirectional

    bandwidth of 150 MB/sec. MPI used for comm unicat ions. Synthetic data set: avg transaction size 15 and 1000

    distinct items. For larger data sets, mu ltiple read of transa ctions inblocks of 1000. HD switch to CD after 90.7 % of the total compu tat ionis done.I High Performance Data Kumar and Mohammed 1 Iining (Vipin 7aki) 38

    3 7 9

  • 8/8/2019 dm tutoria

    71/116

    Parallel Association Rules: ScaleupResults (100K,0.25%)

    i i i i iCD -~ ...ID D -+--20 00 .~

  • 8/8/2019 dm tutoria

    72/116

    Paral le l Associat ion Rules : ResponseT ime (np= 1 6 ,50K)

    5 0 0

    4 5 0

    40 0

    35O

    &n-

    3O O

    2 5 02 0015 0100

    5 0

    0

    i iCD -~--HD -+--"IDD -o--simple hybr id --~ ....

    x,. /:p..."/~,"(2408 K)

    . , - . /

    ......... ..::;: ;"ii o8,~), . , . " ' " . S . 2 ; ; " "

    - E l " " , . . ;: ; / "

    1211 K)i I I IO 5 0.25 O.1 0.06Minimum support (%)

    High Performance Data Mining (Vipin Kumar and M ohammed ~ ki ) 141 I

    !

    Paral le l Associat ion Rules : ResponseT ime ,(np =64 ,50K) , , , CD - e~HD -4---1200 IDD -o--s imple hyb rid -.~ ....

    1000

    2 00

    0

    x. . . m /. . . . . / ~ (52 32 K)..o-'" .. ~/

    ..... ..... . ,/" 12408 K)...-." " /. . - " ' " .....,. /................... ........................~ o ~ ~

    (345 K)121T1 K) I ! I I0.5 0.25 0.1 o.oe.04Minimum support (%)High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 142 1

    381

  • 8/8/2019 dm tutoria

    73/116

    Paralle l A ssociat ion Rules :Min imum Suppor t Reachab le

    0.25 ~ 0.15 0.06 0.030.2 0.1 0.04 0.02

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 143 I

    Paral le l Assoc iat ion Rules:Processor C onf igura tion in HD64 Processors and 0 .04 min imu m suppor tml ~l l ~ mm 4 ~6 ~8~ ~ ~ , x ~ ~ , ~ ~ ~ x ~ ,

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 144 I382

  • 8/8/2019 dm tutoria

    74/116

    Parallel Association Rules:Summary of Experiments[ ] HD shows the same l inear speedup and

    sizeup behavior as that of CD.[ ] HD Exploi ts Total Aggregate Main

    Memory, whi le CD does not.[ ] IDD has much better scaleup behaviorthan DD

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 145 i

    Eclat Approach[] Frequent itemset lattice[] Vertical or "inverted" tid-l ist format[] Support counting via intersections[] Latt ice decomposit ion: break into

    subprob lems[] E fficient search strategies[] Independe nt solution of sub prob lems

    I High Mining (Vipin and Mohammed Zaki) 146 JPerformance Data Kumar3 8 3

  • 8/8/2019 dm tutoria

    75/116

    D I S T I N C T D A T A B A S E I T E M SJ a n e A g a t h a S i r A r t h u r M a r kA u s t e n C h r i s t i e C o n a n D o y l e T w a i n

    A C D TD A T A B A S E

    S u p p o r t I I t e m - S e t s

    P . G .W o d e h o u s eW

    A L L F R E Q U E N T I T E M - S E T ST r a n s c a t i o n I t e m s

    1 A C T W2 C D W3 A C T W4 A C D W5 A C D T W6 C D T

    1 0 0 % (6 H C

    ( 4 )1[ A , D , T , A C , A W6 7 % C D , C T , A C W5 0 % ( 3 )

    N

    A T , D W , T W , A C T , A T WC D W , C TW , A C T W

    Example Database

    M A X I M A L I T E M - S E T S ( M IN S U P P O R T = 5 0 % ) : A C T W , C D WI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 147 II I

    Frequent ItemsetLatticei / . / :. . . . . . . . . . . . . . . . . 2 1 1 / . : : : : - ~ : ~ : " T " " ~ : : : : . : : : : : : . . ~ I I I I I. . . . . . . . . . . . .. . . . . . . . . . . . - . . . . . " . . . . . . . . . . . . . .

    0D O W N W A R D C L O S E D O N I T E M - S E T S U P P O R TM A X IM A L F RE Q U E N T I TE M -S E TS _ " A C ~ C D WHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 14 8 I

    3 8 4

  • 8/8/2019 dm tutoria

    76/116

    Eclat: Support CountingA C D ~................ ~ ...: : : :.......: : : : ..-T"---..: : : : : : ....: : : : :... . . . . . . . . . . .

    !A C D T W C O

    i i ,n t e r s e c tC & DT R A N S A C T I O N L I S T S ( T I D L I S T S IC W

    High Performance Data Min ing (Vip in Kumar and Mohammed Zaki )

    lD W/1n t e r s e c tC D & C W 1491

    Eclat: Lattice Decomp ositionA C D ' r V v

    .... " " ' - , . i ' " ' " . . " / . . . . ' " ' " " "{ }

    E Q U I V A L E N C E I , A , : ~ . . A C . A . . A W , A C ' r , A C W . A ~ . A C . W , ]C L A S S E S [ C ] C , C D , C T , C W , C D W , C T ~N }[D ] = { D , D W } [31"] = { T , TW } [W ] = { W } IC R O ~ - C L A ~ L IN K~ U ~ ; F D F O R P R U N I N GHigh Pe f fo~nance Da ta Min ing (V ip in Kumar and M ohamm ed Za ld ) 150 I

    3 8 5

  • 8/8/2019 dm tutoria

    77/116

    Lattice Search Strategies Bottom-up

    - Lev el - wis e l ike Apriori Top-down

    - Start with larges t element. If frequent,done! Else look at subsets

    Hybrid- Find long frequent/m aximal itemsets- F in d remaining using bottom-up searchI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 151 ]

    Bottom-up Lattice SearchA C D ~. . . . . , . " . " . " - , . . .

    . . ' "A C D T A C D W" ~ '- ~ : : ~ : . . . . . . . . . . . '" ~ : .: : . . . . . .

    A C I D

    A D T W. . . . : : : : { / : > ' "/ .

    N E W E Q U IV A L E N C E I [A C ] = { A C , A C T , A C W , A C T V V } IC L A S S E S [AT ] = { AT, ATVV }l A W ] = { A W }K N O W N M A X I M A L F R E Q U E N T I T E M S E T S : C D W , C T ~ /

    High Performance Data Mining (Vipin Kum ar and Mohammed Zaki) 152 I3 8 6

  • 8/8/2019 dm tutoria

    78/116

  • 8/8/2019 dm tutoria

    79/116

    Count Di s tr ibut ionP R O C E S S O R 0 P R O C E S S O R 1 P R O C E S S O R 2P A R T I T I O N E D D A T A B A S E

    C O U N T I T E M SI I G E T F RE Q U E N T I T E M S

    F O R M C A N D I D A T E P A I R SC O U N T P A I R S

    B G E T F R E Q U E N T P A I R SF O R M CA N D I D A T E T R I P L E SC O U N T T R I P L E S

    MHigh Per fo rmance Da ta Min ing (V ipin Kumar and Moham med Zak i ) 155 ]

    Paral le l EclatP R O C E S S O R O P R O C E S S O R 1 P R O C E S S O R 2

    P A R T I T I O N E D D A T A B A S E

    S E L E C T I V E L Y R E P L I C A T E D D A T A B AS E

    I Per fo rmance Da ta Kumar and Moham med 156 Iigh Min ing (Vip in Zaki )3 8 8

  • 8/8/2019 dm tutoria

    80/116

    Parallel Eclat AlgorithmI T E M S E T L A T T I C E C L A S S P A R T I T I O N I N G

    E Q U I V A L E N C E C L A S S E S[ A ] = { A C , A T , A W }[ c ] = { C D , C T , C W }[D ] = { D W }i - r ] = { - r w }

    C L A S S S C H E D U L EP r o c e s s o r 0 ( P 0 )

    P r o c e s s o r 1 ( P 1 )I C ] = { C D , C T , C W } I[ D ] { D W }T I D - L I S T C O M M U N I C A T I O N

    In ml i i i l ! ! l l ! ! ! l, i = , "O r i g i n a l D a t a b a s eA C D T W P a r t i t i o n e d D a t a b a s e A f t e r T i d - L i s t E x c h a n g e IP 0 P 1 P 0 P 1 J[~-~ ~---I IC o T w l l

    / m m m u l u m = i = u li n = = . = H i i ' = ' = m = = l lI n n i I l l l l l l I I n il ll i b n i m l I i I i i l l[ - - - " 1 1- - - =11High Performance Data Mining (Vip in Kumar and Mohammed Zak i ) 157 I

    Eclat Ex p eriments[ ] Database: T20 .16 .D4550K

    - 1 0 0 0 It em s- 4 5 5 0 , 0 0 0 Tr an s ac t io ns

    [ ] Machine: Eight 4- way SM P nodes- D ig ita rs Memory Channel ( 5 gs ,30Mb /s )- 23 3 Mhz , 25 6 MB, 1MB L2 Cache

    [ ] Hierarchical Paral lel ization- M e s s a g e pas s in g + S M P

    High Performance Data Mining (Vip in Kumar and Mohammed Zaki ) 1 5 8 ]3 8 9

  • 8/8/2019 dm tutoria

    81/116

    Paral le l Perform anceP a r a l le l P e r f o r m a n c e 1 P r o c / H o s t'

    6000

    5000

    4000

    3~00

    21100

    H1 H2 H4 H8Numberof Hosts

    l O O O

    0

    P a r a l le l S p e e d u p ( 1 P r o c / H o s t ) I

    I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 159 I

    Paral le l PerformanceP a r a l le l P e r f o r m a n c e 2 P r o c / H o s t

    H1 H2 H4 H8Numberof Hosts

    3500

    300O

    2500

    21100

    1500

    IOO0

    5130

    P a r a l l e l S p e e d u p ( 2 P r o c / H o s t )7 r

    6

    5

    4

    3

    2

    !

    0

    I Performance Data Kumar and Mohammed 160 IIHigh Mining (Vipin Za~d)3 9 0

  • 8/8/2019 dm tutoria

    82/116

    Tutorial overview Overview of KDD and data mining Parallel design space Classif icat ion Associations [Sequences Clustering Future directions and summary

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 181 I

    Discovering Sequential A ssociationsGiven:A set of objects with associated event occurrences.

    eventsObject 1

    Object 2l0 20 30 40 50

    imel ine

    Performance Data Kumar and Mohammed 162 IHigh Mining (Vipin Zaki)39 1

  • 8/8/2019 dm tutoria

    83/116

    Sequential Pattern Discovery:DefinitionGiven is a set of objects, with each ob ject associated with i ts ownt ime l ine o f e vents, f ind rules that predict strong sequentialdependencies among dif ferent events.

    (A B) (C) }, (D E) iRules are formed by f irst d isovering patterns. E vent occurrences inthe patterns are governed by t iming constraints.

    (A B) (C) (D E)i: < -_x~ "~LI"= ,Tr 1

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 163 ]

    Sequential Patterns: ComplexityMuch h igher computa t iona l com p lex i t y thanassociation rule discovery.

    - O ( m k 2 k -l ) num be r o f poss ib le sequen t ia l pa t te rns havingk events, where m is the total number of possible events.

    Time constraints offer some pruning. Further use ofsupport based pruning contains complexity.- A subsequence of a sequence occurs at least as many t imesas the sequence.- A sequence has no more occurrences than any of i tssubsequences.- Bui ld sequences in increasing num ber of events. [GSPalgorithmby Agarwal & Srikant]

    High Mining (Vipin Kumar and Mohammed Zaki) 164 ]Performance Data3 9 2

  • 8/8/2019 dm tutoria

    84/116

  • 8/8/2019 dm tutoria

    85/116

    Sequential Apriori:Count operation (contd..) Hash Tree used for fast search of cand idateoccurrences. Similar to association rulediscovery, except for following differences.

    - Every event-timestamp pair in the timelineis hashed at the root.- Events eligible for hashing at the next levelare determined by the maximum gap (xg),

    win dow size (ws), and span (ms)constraints.I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 167 I

    Sequence Hash Tree for Fast AccessHash Function

    2,5 ,8(1@15)

    (1@0)(4@0)(1@12)

    1( 2@ 5 )(s@s)(1 )(1 )(9 )

    I 1 )(2 ,5 )( 4 , 5 , 8 )Candidate Sequence Hash TreeI

    ( 3 @ 5)

    1 ( 7 ) ( 6 , 9 )_~5)

    ( 1 ) ( 5 ) ( 9 )(1 ) (2 ,3 )( 4 ) ( 8 , 9 )

    Object:

    1 9I 1 I I I0 5 10 15 20

    m s< >

    1 F o u r I n t e r e s t in g P a t h s : |I 1 @ 0 , 2 @ 5 , 3@ 5 I1 @ 0 , 3 @ 5| 4 @ 0 , 5 @ 5I 1 @ 1 2 , 1 @ 1 5

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 168 ]

    3 9 4

  • 8/8/2019 dm tutoria

    86/116

    C ounting Leaf Lev el C andidatesP a r t of Candidate Sequence Hash__Tree

    Count = 12

    Object 2:

    I I[ " < - > ~I .Performance Data (Vipin Kumar and Mohammed Zaki) 16 9 Iigh Mining

    Parallel Seq uential As soc iations Need:

    - Enormity of Data.- Mem ory and Disk l imitations of serial algorithms

    running on single processor. Can algorithms for non-sequential

    associations be extended easily?- Sequential Natu re gives rise to compl ex issues:

    L o n g e r T i m e l i n e s L a r g e S p a n v a l u e s L a r g e N u m b e r o f C a n d i d a t es

    I Performance Data Kumar and Mohammed 170 IHigh Mining (Vipin Zaki)3 9 5

  • 8/8/2019 dm tutoria

    87/116

    Parallel Sequential Associations:Event Distribution (EVE)

    P0 P1 P2

    ~ 1 o All Broadcast of Support C o u n ~High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 171EVE Algorithm: Challenging Case

    Transfer Hash Tree StatesP0 P1 ~,~To~e~ P2 ~

    I3 9 6

    Transfer Partial CountsHigh Performance Data Mining (Vipin Kum ar and Mohammed Zaki) ' 172 I

  • 8/8/2019 dm tutoria

    88/116

    Event and Candidate( E V E C A N )

    e i t h e r

    DistributionRotate Candidates inRound-Robin Manner

    or inHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) ~7~J

    E V E C A N :Generation

    Parallelizing Candidate

    I Candidates Stored in a Distributed Hash Tab leI Hash F unction:Candidate Sequence, S = > h(S ) = (Pi, I )

    ii i ,: W, AC->T, ~!AC->D ~ J AU >W ~A'>IW :~.I ~ i '" [ , AC->TW C->D.

    MAXIMAL FREQUENTSEQUENCESAC->D AC->TW C->D->TW176 I

  • 8/8/2019 dm tutoria

    90/116

    Frequent sequence latticeL A T T I C E I N D U C E D B Y M A X I M A L F R E Q U E N T S E Q U E N C E S

    A C - > D , A C - > T W , C - > D - > T W

    {}High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 177 I

    Sequence lattice decomposition

    { }Performance Data Mining (Vipin Kumar and Mohammed Zaki) 178 Iigh

    3 9 9

  • 8/8/2019 dm tutoria

    91/116

    Temporal JoinsC u s t o m e r - T r a n s a c t i o n L i s t I n t e r s e c t i o n

    P - > X - > Y

    P - > X

    4 I 607 | 40

    -m~l- :~ r -' 1 5 I s o1 7 I 2 02 0 I 1 0

    P - > Y( 3 1 1 0 I T I I )

    3 I 10s / 7 0

    1 3 1 01 6 8 02 0-]

    P - > Y - > X

    P - > X YI C l I ~ T i n

    i :

    Dynamic C omp utation Tree

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 179 ]

    l i p

    t Performance Data Kumar and Mohammed 180 Iigh Mining (Vipin Zaki)40 0

  • 8/8/2019 dm tutoria

    92/116

    Parallel Design Space Dat a parall elism: within a clas s

    - I d l i s t para l le l ism Single idl ist join L eve l-wis e idl ist join

    - Jo in para l le l ism Task paral le l ism: between classes

    - pSPADE Static vs. dynamic load balancingHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 181 I

    Idlist Data Parallelism Sing l e id l is t join

    - Sp l i t i d l i s t s on c i drange

    - E ach proc intersectsover i ts c id range

    - Barr iersynchron izat ion foreach jo in

    P0: cid in range 1-500PI: cid in range 501-1000Parallel Idlist Intersection

    I Mining (Vipin Mohammed Zaki) 182 Iigh Performance Data Kumar and4 0 1

  • 8/8/2019 dm tutoria

    93/116

    Id l i s t Data Para l le l i smL e v e l -w i s e i d l i s t j o i n- Proce ss all c lass es

    at current level- E ach proc st il l

    intersects over i tslocal DB

    - Barr iersynchron iza t ion andsum-reduct ion fo reach level

    P0 : c i d i n ra nge 1 -500P I : c i d i n r a n g e 5 0 1 - 1 0 0 0

    Pa ra l le l Pa i r -W i seIn t e r se c t i ons

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 183 ]

    Join Data Para l le l i sm E ach processo r

    per forms d i f ferentintersect ions wi thin aclass

    Barr ier af ter sel f - jo in ingwithin a class, beforeprocessing c lasses atnext level # synch ron izat ions isbe tween S ing le andLevel-wise paral le l ism

    A ss i gn e a c h i d l i s t t o ad i f f e re n t p roc e s so r

    I Performance Data Kumar and Mohammed Zaki) 184 IHigh Mining (Vipin

    4 0 2

  • 8/8/2019 dm tutoria

    94/116

    Task Parallelism:Static Load Balancing Given level 1 classes , C1, C2 , C3, ... Assign weights W l, W 2, W 3, . .. on the class size (#

    sequences in the class) G reedy schedul ing of ent i re c lasses

    - Sort c lasses on weig ht (descending)- Assign class to proc wi th least total weight

    Each proc solves ass igned c lasses asynchron ously Hard to get accurate work est imate

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 185 I

    Static and Dynamic~Ioadq~l~O qalincin~

    High Performance Data Mining (Vipi n~umar and Mohammed Zaki) 186 I4 03

  • 8/8/2019 dm tutoria

    95/116

    Task Parallelism:Dy namic Load Balancing

    Give n level 1 classes, C1, C2 , C3, ... andtheir weights

    Sort classes on weight (descend ing) andinsert in a logical task queue

    A proc atomical ly grabs the f irst avai lab leclass and solves i t completely

    Rep eat unti l no mo re classe s in que ue Uses only inter-class parallelism Q ueue contention negl ig ib le s ince c lasses

    are coarse-gra inedHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 187 I

    Ta sk Pa ra l l e li sm :Recursive DLB (pSPADE) Proce ss availab le classe s in paral lel W orst case: P-1 procs free, 1 proc bu sy Classes sorted on weight, last class

    usually small (but i t could be large) Provide me cha nism for free procs to

    join busy group At each level, get free procs, insert theclasses into a shared task queue;process avai lable c lasses in paral le l

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 188 I4 0 4

  • 8/8/2019 dm tutoria

    96/116

    pSPADE: RecursiveD,;namic LoadBalancing

    I

    ] High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 189 ]

    Experimental Setup SG I O r i g in 2 000

    - 12 195Mhz R10K MIPS processors- 4 M B cache, 2GB memory

    S y n t h e t i c d a t a s e t s-C: number of t ransact ions/customer- T : average transaction size- S/I: average sequen ce/item set size- D : number of customers

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 190 I405

  • 8/8/2019 dm tutoria

    97/116

    D a t a v s . T a s k P a r a l l e l i s mC10-T5-S4-11.25-D1M C20-T5-S8-11.25-D1M

    500"400'

    O )._E 300-I- -r-.2 200.100-

    O.

    i D D a t a T a s k I400350

    -'G" 300250

    .E_ 200i - - 150.m 100u~ 50

    DDa t a 1 T a s k I

    1 2 4 1 2 4Num ber of Pro cess ors Number of ProcessorsI High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 191 I

    S t a t i c v s . D y n a m i c D L B

    600-500 -400-

    p. .

    ~ 300-g200-100-

    O_

    [ l S t a t i c Dy n ami c R ecu r s i v e ]

    C10T5S411.25 C20T5S811.25 C20T5S812

    Dy namic is up to38% bet ter thanStatic

    Recurs ive is up to12% bet ter thanDynamic

    O ver all Recursiveis 44% better thanStatic

    Performance Data Kumar and Mohammed 192 IIHigh Mining (Vipin Zaki)4 06

  • 8/8/2019 dm tutoria

    98/116

    Paral le l PerformanceC10-T5-S4-11.25-D1M C10 - T 5 - S4 - 11 . 2 5 - D1M

    400-350-

    ~. 300-250-

    ~ 200-.o t50-

    ..=

    ~, 100-50-

    . 1 2 4 8Number of Processors

    ,

    3.E. ni.,- 2,

    ~ I ..

    t 2 4 8Number of Processors

    High Performance Data Mining (Vipin Kumar end Mohammed Zaki) 193 I

    Paral le l PerformanceC20-T5-S8-12-D1M C20-T5-S8-12.D1M

    1400.1200.1000

    ~' 800'.s _I - -- - 600..o= 400,

    "~ 200.T =

    t 2 4 8 t 2 4 8Number of Processors Number of ProcessorsI High Performance Data Kumar and Mohammed 194 Iining (Vipin Zaki)

    4 0 7

  • 8/8/2019 dm tutoria

    99/116

    Scaleup and Supportf I180-

    160-140-

    =) t20-Em~'- t00-._o 80-~, 6 0 ~

    40

    o ~

    CS-T2.5-S4-1125

    2 4 8 10M illio ns f Cus tom ers

    C5-T2.5.S4-11.25.D1M9 0 - ' /8 0 / - ;7 0 1 /

    = 6 o /E / '~-- 50-.o 40-" - !o /

    2010 ",~O "0.10% 0.08 % 0,05 % 0.03% 0.01%

    Number of P r ocesso r sHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) 195 J

    pSPADE Summary Task parallel better than data parallel Dynamic load balancing Asynchronous algorithms (independent

    classes) Good locality (uses only intersections) Good speedup and scaleup What next? Gigabyte databases

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 96 J1

    4 0 8

  • 8/8/2019 dm tutoria

    100/116

    Tutorial overview Overview of KDD and data mining Paral lel design space Classif ication Associat ions Sequences 1Cluster ingI Future directions and su m m aryHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) l g 7 r

    What is clustering?Given N k-dimensional feature vectors,

    f ind a "meaningful" part i t ion of the Nexamples into c subsets or groups

    Discover the " labels" autom at ical ly c may be given, or "discovered" much more dif f icult than classif icat ion,

    since in the latter the groups are given,and we seek a compact descr ipt ion

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 198 I4 0 9

  • 8/8/2019 dm tutoria

    101/116

  • 8/8/2019 dm tutoria

    102/116

    Clustering schemes Distance-based

    - Numer ic Eucl idean distance (root of sum of squared

    di f ferences a long each dimension) Angle between two vectors

    - Categor ica l Nu mb er of comm on features (categor ical )

    Partition-based- E num era te pa r ti ti ons and score each

    High Performanc e Data Mining (Vipin Kumar and Mohamme d Zaki) 201 I

    Clustering schemes Model-based

    - Est imate a dens i ty (e .g . , a m ix ture o fgauss i ans )

    - G o bum p- hun t i ng- C om pu t e P ( F ea tu r e V ec t o r i I C l us te r j)- F inds over lapp ing c lus ters too- E x a m p l e : b a y e s ia n c lu s te r in g

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 202 J411

  • 8/8/2019 dm tutoria

    103/116

    Before clustering N o r m a l i z a t i o n :

    -G i ve n th re e a t t r i bu te s A -- micr o-se cond s B -- mi l l i -s econ ds C -- sec ond s

    - Can ' t t reat d i f ferenc es as the sa m e in a lld imens ions o r a t t r ibu tes

    - N e e d to sca le o r no rma l ize fo r compar i son- C a n a ss ig n we ig h t f or m o re im p o r ta n ce[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 203 ]

    The k-means algorithm Speci fy 'k ' , the number of c lusters Guess 'k ' seed c luster centers 1) L ook at each e x am ple and ass ign i t

    to the center that is c losest 2 ) Recalcu la te the center I terate on steps 1 and 2 t ill ce nte rs

    converge or for a f i xed number o f t imes!High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 204 J

    4 1 2

  • 8/8/2019 dm tutoria

    104/116

    K-means algorithm

    00

    0

    0

    00

    0 0 0 : Initial seeds

    oO 0O 0o/ o

    ' ! Performance Data Mining (Vipin Kumar and Mohammed Zaki) 205 Iigh I

    K-means algorithmO O New centers ~w-o * * * *

    0

    0 0 0o ~ o0 0 0 0I=

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 206 [413

  • 8/8/2019 dm tutoria

    105/116

  • 8/8/2019 dm tutoria

    106/116

    Paral lel k -mean s Divide N points among P processors Repl icate the k centroids Each processor computes distance of

    each local point to the centroids Assign points to closest centroid andcompute local MSE Perform reduction for global centroidsand global MSE value

    I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 209 I

    G a u s s i an mi x t u r e mo d e l s Drawbacks of K-means

    - Doesn't do well with overlapping clusters-Clusters easl iy pu l led of f center by out l iers- E ach recor ds is either in or out of a cluster;

    no not ion of some records begin more orless l ikely than others to really belong tocluster they have been assigned

    G M M : probabi l ist ic variants of K-meansHigh Mining (Vipin and Mohammed 210 Ja k i ) 1Performance Data Kumar

    4 1 5

  • 8/8/2019 dm tutoria

    107/116

    Est imation-maximizat ion Choose K seeds: means of a gaussian

    distr ibut ion Est imation: calculate probabil i ty of

    belonging to a c luster based on distance Ma xim izat ion: move mean of gaussian

    to centroid of data set, weighted by thecontr ibut ion of each point Repeat t i l l means don' t move

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 211 I

    Devia t ion /ou t l i e r de tec t ion Find points that are very dif ferent from

    the other points in the dataset Could be "noise", that causes problems

    for classif icat ion or cluster ing Could be the really " interest ing" points,for example, in fraud detect ion, we are

    mainly interested in f inding thedeviat ions f rom the norm

    Performance Data Mining (Vipin Kumar and Mohammed Zaki) 212 Iigh4 1 6

  • 8/8/2019 dm tutoria

    108/116

    D eviation de tection

    outl ier .

    J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 213 ]

    K-nearest neighb ors Classif icat ion technique to assign a

    class to a new example Find k-nearest neighbors, i.e., most

    similar points in the dataset (compareagainst all points!) Assign the new case to the sam e classto which most of its neighbors belong

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 214 I4 1 7

  • 8/8/2019 dm tutoria

    109/116

    K-nearest neighbors0 0+~

    0 0 0 0 0+0 0

    0 0 0 0* * 2 + + o o

    + ++ + 0 0

    0

    Neighborhood5 of class 03 of class +- k -o

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 215 ntTuto r i a l ove rv iew

    Overv iew o f KDD and data min ing Paral le l design space Classi f icat ion Assoc ia t ions Sequences Cluster ing l Future directions and summary

    High Performance Data Mining (Vipin Kumar and Mohamm ed Zaki) 216J4 1 8

  • 8/8/2019 dm tutoria

    110/116

    Large-scale Parallel KDD Systems Terabyte-sized datasets Centralized or distributed datasets Incremental changes Heteroge neous data sources Pre-processing, mining, post-processing Modular (rapid development) Benchmarking (algorithm comparison)

    I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 21 7 ]

    Research Directions Fast algorithms: different mining tasks

    -C lass i f i ca t i on , c l us te r i ng , assoc ia t i ons , e t c .- I n co r p o r a t i n g co n ce p t h i e r a r ch i e s

    Parallelism and scalabil i ty- Mi l l ions o f records- T h o u s a n d s o f a t tr ib u t e s /d i m e ns io n s- S ing le pass a lgor i thms- Sampl ing- Para l le l I/O and fi le sy ste m s

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 218 I4 1 9

  • 8/8/2019 dm tutoria

    111/116

    Research Directions (contd.) Data locality and type

    -Dis t r ibuted data sources (www)- Text and mult imedia mining- Spatial data mining

    Incremental mining: refine knowledgeas data changes Interactivity: anytime mining

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 219 I

    Research Directions (contd.) Tight database integration

    - Push com mo n pr imi t ives ins ide DBM S- U s e mult iple tables- Use eff ic ient indexing techniques-C a ch in g s t ra tegies for sequence of datamining operat ions- Data mining que ry langu age and parallel

    query opt imizat ionHigh Performance Data Mining (Vipin Kumar and Mohammed Zaki) J1220

    4 2 0

  • 8/8/2019 dm tutoria

    112/116

    Research Directions (contd.)[ ] Understandabi l i ty: Too many patterns

    -Incorporate background knowledge-Integrate constraints- Meta-level mining- Visualization

    [] Usabil i ty: build a complete system- P r e - p r o c e s s i n g , mining, post-processing,persistent management of mined results

    High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 221 I

    Su mm ary O f the Tutorial Data mining is a rapidly grow ing field

    - Fueled by eno rmo us data col lect ion rates, andneed for intel l igent analysis for business andscientif ic gains.

    Large and high-dimensional nature datarequires new analysis techniques andalgorithms. Scalable, fast parallel algorithms arebecoming indispensable. Many research and commercialopportunities!!!I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 222 I

    4 2 1

  • 8/8/2019 dm tutoria

    113/116

    ResourcesWorkshops HiPC Special Session on Large-Scale Data Mining, 2000.http:llwww.cs.rpLedul-zakilLSDMI ACM SIGKDD Worksh op on Distdbuted Data Mining, 2000.http:/Iwww.eecs.wsu.edul-hillollDKDIdpkd2OOO.html 3rd IEEE IPDPS Worksho p on High Performance Data Mining, 2000.http://www.cs.rpi.edu/~zaki/HPDM/ ACM SIGKDD Workshop on Large-Scale Parallel KDD Systems, 1999.http:llwww.cs.rpi.edul-zakiiWKDD991 ACM SIGKDD Worksh op on Distributed Data Mining, 1998.http://www.eecs.wsu.edu/-hillol/DDMWS/papers.html 1st IEEE IPPS Worksh op on High Performance Data Mining 1998.http://www.cise.ufl.edu/- ranka/B o o k s A. Freitas and S. Lavington. Mining very large database s with parallel processing. Kluwer AcademicPub., Boston, MA, 1998. M.J . Zaki and C.-T. Ho (eds). Large-Scale Parallel Data Mining. LNAI State-of-the-Art Survey,

    Volume 1759, Springer-Verlag, 2000. H. Kargupta and P. Chan (eds). Advance in Distributed and Parallel Knowledge Discovery, AAAIPress, Summer 2000.H igh Per f o rm ance Da t a M in ing (V ip in Kumar and Moha mmed Zak i ) 22 3 ]

    Resources (contd . )Journal Special Issues P. Stolorz and R. Musick (eds.). Scalable High-Performance Computing for KDD, Data Mining andKnowledge Discovery: An International Journal, Vol. 1, No. 4, Decemb er 1997. Y. Guo and R. Grossman (eds.). Scalable Parallel and Distributed Data Mining, Data Mining andKnowledge Discovery: An International Journal, Vol. 3, No. 3, September 1999. V. Kumar, S. Ranka and V. Singh. High Performance Data Mining, Journ al of Parallel and DistributedComputing, forthcoming, 2000.Survey Articles F. Provost and V. Kollud. A survey of methods for scaling up inductive algorithms. Data Mining andKnowledge Discovery: An International Journal, 3(2 ): 131--169, 1999. A. Srivastava, E.-H. Han, V. Kumar and V. Singh. Parallel formulations of decision-tree classificationalgodthms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237 --26 2, 1999. M.J . Zaki. Parallel and distdbuted association mining: A survey. In IEEE Concu rrency special issueon Parallel Data Mining, 7(4):14-2 5, Oct-Dec 1999. D. Skillicom. Strategies for parallel data mining. IEEE Concurrency, 7(4 ):26 --35, Oct-Dec 1999. M.V. J oshi, E.-H. Han, G. Karypis and V. Kumar. Efficient parallel algorithm for mining associations.In Zaki and Ho (eds.), Large-Scale Parallel Data Mining, LNAI 1759, Spdnger-Verla g 2000. M.J . Zaki. Parallel and distdbuted data mining: An introduction. In Zaki and Ho (eds.), Large-ScaleParallel Data Mining, LNAI 1759, Spdnger-Vedag 2000.

    IH i gh P e r f o r m a n c e D a t a M i n i ng ( V ip i n K u m a r a n d M o h a m m e d Z a k i) 2 2 4 I

    42 2

  • 8/8/2019 dm tutoria

    114/116

  • 8/8/2019 dm tutoria

    115/116

    References: Associations (contd.) A. Mueller. Fast sequential and parallel algorithms for association rule mining: A comparison. Technical

    Report CS-TR-3515 , University of Maryland, College Park, August 1995. J.S . Park, M. Chen, and P. S. Yu. Efficient parallel data mining for association rules. In ACM Intl. Conf.Information and Knowledge Management, November 1995. T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for mining association rules. In 4th Intl.Conf. Parallel and Distributed Info. Systems, Decembe r 1996. T. Shintani and M. Kitsuregawa. Parallel algorithms for mining generalized association rules withclassification hieramhy. In ACM SIGMOD International Conferen ce on Managemen t of Data, May 1998. M. Tamura and M. Kitsuregawa. Dynamic load balancing for parallel association rule mining onheterogene ous PC cluster systems. In 25 th Int'l Conf. on Ve ry L arge Data Bases, September 1999. M.J . Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4 ):14--25 , October-December 1999. M.J . Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared-memory multi-processors. In Supercomputing'98, No vembe r 1996. M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. U. New algorithms for fast discovery of association rules.In 3rd Intl. Conf. on Knowledg e Discovery and Data Mining, August 1997. M.J . Zaki, S. Parthasarathy, M. Ogihara, and W. U. Parallel algorithms for fast discovery of associationrules. Data Mining and Knowledge Discovery: An Intemational Journal, 1 4):343-37 3, December 1997. M.J . Zaki, S. Parthasarathy, W. Li, A Localized Algorithm for Parallel Association Mining, 9th Annual ACMSymposium on P arallel Algorithms and Architectures (SPAA), June, 1997.

    H i g h P e r f o r m a n c e D a t a M i n i ng ( V i pi n K u m a r a n d M o h a m m e d Z a k i) 2 2 7 I

    References: Sequences R. Agrawal and R. Srikant. Mining sequential patterns. In 1 th Intl. Conf. on Data Engg., 1995. H. Mannila and H. Toivonen and I. Verkamo. Discovery of frequent episodes in event sequences. DataMining and Knowledge Discovery: An Intemational Journal. 1(3):259-289, 1997. T. Oates , M. D. Schmill, and P. R. Cohen. Parallel and distributed search for structure in multivar iate timesedes. In 9th European Conference on Machine Learning, 1997. T. Oates, M. D. Schmill, D. Jansen, and P. R. Cohen. A family of algorithms for finding temporal structure indata. In 6th Intl. Worksh op on AI and Statistics, March 19 97. T. Shintani and M. Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach.In 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, April 1998. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In

    5th Intl. Conf. Ex tending Database Technology, Mamh 1998. M.J . Zaki. Efficient enumeration of frequent sequences. In 7th Intl. Conf. on Information and KnowledgeManagement, November 1998. Mohammed J. Zaki, Parallel Sequence Mining on SMP Machines, ACM SIGKDD Worksh op on Large-ScaleParallel KDD Systems, 1999 (LNAI Vo11759).

    H i g h P e r f o r m a n c e D a t a M i n i n g ( V i pi n K u m a r a n d M o h a m m e d Z a k i) 2 2 8 J1

    4 2 4

  • 8/8/2019 dm tutoria

    116/116

    References: Clustering K. Alsabti, S. Ranka, V. Singh. An Efficient K-Means Clustering Algorithm. 1st IPPS Workshop on High

    Performance Data Mining, March 1998. I. Dhillon and D. Modha. A data clustering algorithm on distributed memory machines. In Zaki and Ho (eds),Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springe r-Vedag 2000. L. lyer and J. Aronson. A parallel branch-and-bound algorithm for cluster analysis. Annals of OperationsResearch Vol. 90, pp 65-86, 1999. E. John son and H. Kargupta. Collective hierarchical clustering from distributed heterogeneous data. In Zakiand Ho (eds), Large-Sca le Parallel Data Mining, LNAI Vol. 1759, Springer-Verlag 2000. D. Judd, P. McKinley, and A. Jain. Larg e-scal e parallel data clustering. In Int'l Conf. Pattern Recognition ,August 1996. X . Li and Z. Fang. Parallel clustering algorithms. Parallel Computing, 11 :270 - -290 , 1989. C.F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21:1313--1325, 1995. S. Ranka and S. Sahni. Clustering on a hypercube multicomputer. IEE E Trans. on Parallel and DistdbutedSystems, 2 (2): 129-- 137, 1991. F. Rivera, M. Ismail, and E. Zapata. Parallel squared error clustering on hypercube arrays. Journal ofParallel and Distributed Computing, 8:292 --299, 1990. G. Rudolph. Parallel clustering on a unidirectional dng. In R. Grebe et al., editor, Transputer Applicationsand Systems'93: Volume 1, pages 48 7--493. lOS Press, Amsterdam, 1993. H. Nagesh, S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspa ce clustering for very large

    data sets. Technical Report 9906-010, Ce nter for Parallel and Distributed Computing, NorthwesternUniversity, June 1999. X. Xu, J. Jag er and H.-P. KriegeL A fast parallel clustering algorithm for large spatial databases. DataMining and Knowledge Discovery: An International Journal. 3(3):26 3--290, 1999.

    I High Per f o rma nce Da t a M in ing (V ip in Kumar and Moha mme d Zak i ) 22 9 I

    References: Distributed DM J. Aronis, V. Kolluri, F. Provost, and B. Buchan an. The WORLD: Know ledge discovery from multipledistributed databases. In Florida Artificial Intelligence Research Symposium, May 1997. R. Bhatnagar and S. Srinivasan. Pattern discovery in distributed databases. In AAAI National Conferenceon Artificial Intelligence, July 1997.