38
Algorithms for Distributed Supervised and Unsupervised Learning Haimonti Dutta The Center for Computational Learning Systems (CCLS) Columbia University, New York.

Algorithms for Distributed Supervised and Unsupervised Learning Haimonti Dutta The Center for Computational Learning Systems (CCLS) Columbia University,

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Algorithms for Distributed Supervised and Unsupervised Learning

Haimonti DuttaThe Center for Computational Learning Systems (CCLS)

Columbia University, New York.

About ME• BCSE from Jadavpur University, Kolkata• MS from Temple University, PA• Ph.D. from University of Maryland,

Baltimore County• Research Interests–Machine Learning and Data Mining– Data Intensive Computing– Distributed Data Mining and Optimization

• Website: www1.ccls.columbia.edu/~dutta

3

The Data Avalanche• High Energy Physics (CERN)– Subatomic particles are accelerated to nearly the speed of light and then collided– Measured at time intervals of 1 nanosec– 1 peta byte of data

• Astronomy (SDSS, 2MASS)– Telescopes observe galaxies, stars, quasars– Of the order of 215 million objects, Approx 100 – 200

attributes per object• Genome Sequences (Human Genome Project)• Advanced Imaging (fMRI , CT scans)• Simulation Data (climate modeling, earth observation)• Time Series Data (EEG, ECG, ECoG)

4

Astronomy Sky Surveys• Example – Sloan Digital Sky Survey • Telescope observes galaxies, stars, quasars• Few hundreds of attributes for each observed object.• The Data Release 5

– 8000 square degrees– 215 million objects.

Sky Surveys

ROSAT XMM

GALEX HDFN

FIRST IRSA

NVSS UDF….

Graph, Graph, Everywhere

Aspirin Yeast protein interaction network

from

H. J

eong

et a

l Nat

ure

411,

41

(200

1)

Internet Co-author networkSlide borrowed from Dr Jiawei Han’s tutorial on graph algorithms

6

Centralized versus Distributed Data Mining

DB 1 DB N

Centralized Data Repository

Data Mining Model

Data / Compute Node 1

Data / Compute Node N

Distributed Computation

Distributed Computation

Data Mining Model

Problems of Centralizing Data: (1) Communication Cost (2) Privacy Loss

Issues unique to DDM

• Communication– Machine Learning on a central server – No

communication cost incurred– Distributed Mining – communication not free, treated

as a ‘resource’

• Incomplete knowledge (input data, surrounding network structure, etc.)

• Coping with failures – Many things can go wrong!• Timing and synchrony

8

Road Map• Problem Motivation• Potential Applications• DDM Basics– Data Distribution– Communication Protocols

• Gossip-based communication• Randomized Gossip• Converge cast, Up cast and Down cast

• Distributed Classification• Distributed Clustering• Mining on Large Scale Systems: Challenges and Open

Problems

Data Distribution

• Horizontal Partitioning: Each site has exactly the same set of attributes– Example: A departmental store using a standard

database schema for its customer base. Same database maintained at different geographic locations.

• Vertical Partitioning: Different attributes are observed at different sites.– Example: Astronomy example described earlier

Timing and Synchrony• Synchronous Model

– Message sent by node v at time P, must reach neighbor u latest by time P+1

– System driven by global clock– Send message to neighbors, receive messages from neighbors, perform

computation

vv uu

P

Local Clock

P P+1

Timing and Synchrony contd ..

• Asynchronous Model– Algorithms are event driven– No access to a global clock– Messages from one processor to neighbor arrive

within a finite but unpredictable time– Question: How do you know whether a message

was sent by a neighbor or not?– Non-deterministic in nature– Arbitrary ordering of messages delivered

An example: Asynchronous Messages

vv

uu

PP

Input

X = 0

X = 1

P’s ProtocolUpon Getting A Message Print it

v, u’s ProtocolsSend X to P

Communication Protocols: Part 1

• Broadcast– Disseminate a message M from source s to all

vertices in the network– Common strategy: Use a spanning tree T rooted at

the source s– Tree-cast – Internal vertex gets message from

parent and forwards to children

Communication Protocols: Part 1

• Convergecast • Source can detect that

the broadcast operation has terminated (Termination detection)

• Acknowledgement Echos

15

Gossip-based communication

• Based on spread of an epidemic in a large population

• Suseptible, infected and dead nodes• The “epidemic” spreads exponentially fast

Node1 Node 2

Node 5 Node 4

Node 3

Randomized Gossip• Nodes contact any one neighbor chosen at

random • Models can be asynchronous or synchronous• Asynchronous – Single clock is ticking according

to a rate n poisson process at times Zk, k>1, |Zk+1 – Zk| is an exponential of rate n

• Synchronous model – time is slotted uniformly across all nodes

• Reference: S. Boyd et al, “Randomized Gossip Algorithms”, IEEE Transactions on Information Theory, 2006.

17

Road Map• Problem Motivation• Potential Applications• DDM Basics– Data Distribution– Communication Protocols

• Gossip-based communication• Randomized Gossip• Converge cast, Up cast and Down cast

• Distributed Classification• Distributed Clustering• Mining on Large Scale Systems: Challenges and Open

Problems

Decision Tree Induction

• Example of Quinlan’s ID3 (Play / No Play)

Decision Tree Built on the Data

Outlook = sunny

Outlook = sunny

Humidity <= 75 Humidity <= 75 Outlook = overcast

Outlook = overcast

No PlayNo

PlayPlayPlay PlayPlayOutlook =

rainOutlook =

rain

Windy=trueWindy=true

No PlayNo

PlayNo

PlayNo

Play

April 18, 2023 Data Mining: Concepts and Techniques 20

Algorithm for Decision Tree Induction

• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are discretized in

advance)– Examples are partitioned recursively based on selected attributes– Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain)• Conditions for stopping partitioning

– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority

voting is employed for classifying the leaf– There are no samples left

Attribute Selection: Information Gain

• Select the attribute with the highest information gain• Let pi be the probability that an arbitrary tuple in D

belongs to class Ci, estimated by |Ci, D|/|D• Expected information (entropy) needed to classify a

tuple in D:• Information needed (after using A to split D into v

partitions) to classify D:

• Information gained by branching on attribute A

Info(D) = − pii=1

m

∑ log2(pi)

InfoA (D) =|D j |

|D |j=1

v

∑ × I(D j )

Gain(A) = Info(D)− InfoA(D)

Distributed Decision Tree Construction

• Adam sends Betty “Outlook = Rainy”

• Betty constructs “Humidity=High & Play=Yes” and “Humidity=Normal & Play = Yes”

• Dot product represents tuples “Outlook = Rainy & Humidity = Normal & Play = Yes” AND “Outlook = Rainy & Humidity = High & Play = Yes”

Example Obtained from: C Gianella, K Liu, T Olsen and H Kargupta, “Communication efficient construction of decision trees over heterogeneously distributed data”, ICDM 2004

A technique from Random Projection

• Simple technique that has been useful in developing approximation algorithms

• Given n points in Euclidean space like Rn, project down to random k-diml subspace for k << n.

• If k is “medium-size” like O(-2 log n), then approximation preserves many interesting quantities.

• If k is small like 1, then can often still get something useful.

Johnson and Lindenstrauss Lemma

• Given n points in Rn, if project randomly to Rk, for k = O(-2 log n), then with high probability all pairwise distances preserved up to (after scaling by (n/k)1/2).

25

Distributed Dot Product Estimation Using Random Projection

• Data Matrix: Site A - n X p , Site B – n X q• Normalize data • A and B get a random number generation seed• Generate an l X n random matrix (l << n)• A sends RA and B sends RB to S• Compute D = (RA)T (RB) / l• E[D]= E[AT(RTR)B/ l ] = AT E[RTR] B / l ~ AT B (Johnson

and Linden Strauss lemma)

Distributed Decision Tree Construction

• At each site, locally determine which attribute has the largest information gain

• Keep track of the global attribute (AG) with largest information gain

• For each distinct value of AG -- branch leading from root node will be constructed

• Site with AG will send projection to other sites• Leaf node determination – 1. all instances have

the same class 2. Minimum number of objects allowable in a class reached 3. Child of a node is empty

PLANET: Parallel Learning for Assembling Numerous Ensemble Trees• Ref: B Panda, J. S.

Herbach, S. Basu, R. J. Bayardo, “PLANET: Massively Parallel Learning of Tree Ensembles with Map Reduce”, VLDB 2009

• Components – Controller (maintains a

ModelFile)– MapReduceQueue and

InMemoryQueue

Classifier Design by Linear Programming

• Classification can be posed as an LP problem• Kth instance xK, Weight W• xK W ≥ d• ek is error associated with an instance• LP is written as XW + E = D + S, S contains the

slack variables• Assume that each node in a network has a data

set, how can the classification problem be solved?

H Dutta and H Kargupta, “Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments”, ICDM 2008.

Ensemble Learning in Distributed Environments

T0T1 T2

Classification Function of Ensemble Classifier

Weighted Sum

f1(x)

ai

f2(x) f3(x) fn(x)

f(x) = i fi(x) ai : weight for Tree i

fi(x) : classification of Tree i

Ensemble Approach

• Bagging (Breiman, 96)• Boosting (Freund and Schapire, 99)• Arcing (Breiman, 97)• Stacking (Wolpert, 92)• Rotation Forest (Kuncheva et al)

The Distributed Boosting Algorithm• k distributed sites storing homogeneously partitioned

data• At each local site, initialize the local distribution Δj

• Keep track of the global initial distribution by broadcasting Δj

• For each iteration across all sites– Draw indices from the local data set based of the global

distribution– Train a weak learner and distribute to all sites– Create an ensemble by combining weak learners; use the

ensemble to compute the weak hypothesis– Compute weights, and re-distribute to all sites– Update distribution and repeat until termination.

• Reference: A. Lazarevic and Z. Obradovic, “The Distributed Boosting Algorithm”, KDD 2001.

33

Road Map• Problem Motivation• An Astronomy Application • DDM Basics

– Data Distribution– Synchronous vs Asynchronous algorithms

• Communication Protocols– Gossip-based communication– Randomized Gossip– Converge cast, Up cast and Down cast

• Distributed Classification• Distributed Clustering / Outlier Detection• Mining on Large Scale Systems: Challenges and Open

Problems

CPCA: Collective Principal Component Analysis-based Clustering

• Kargupta et. al (KAIS, 2001)

Perform PCA at Local SiteProject data onto the PCsApply clustering in lower dimension

Perform PCA at Local SiteProject data onto the PCsApply clustering in lower dimension

Central Coordinator

Central Coordinator

Perform PCA at Local SiteProject data onto the PCsApply clustering in lower dimension

Perform PCA at Local SiteProject data onto the PCsApply clustering in lower dimension

Send Global PCs

Perform local clustering at the sites with global PCs

KDEC: Distributed Density Based Clustering

• Mathias Klusch et al (IJCAI 2003)• Homogeneous data partitioning across nodes• Local sites and a Helper agent• Assume global kernel function and bandwidth are

agreed upon• Local density estimates are made at each site• Global KDE is obtained by summing local estimates• Value sent back to local sites which clusters data• Points that can be connected by a continuous uphill

path to local maxima are in same cluster• Privacy preserving variation also exists

Parallel K-means• Dhillon and Modha• Chunk data ( homogeneous partitioning)• Random node selects cluster centroids• Distance between centroids and local data

computed• After each iteration, independent results to be

reduced• Use of MPI to implement the procedure• Parallelization is different from the Distributed

setting!

Summary• Data Avalanche in scientific disciplines• Distributed Data Mining – a relatively new field in the

past 15 years• Data Distribution and Communication Protocols• How does the distributed data affect mining?• Algorithms for Decision Tree construction, Boosting in

distributed settings• Unsupervised Learning in Distributed Environments• A lot more to be done theoretically and empirically!• Interested in collaborating – send email to

[email protected]