1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas...

Scaling Up Classifiers to Cloud Computers

Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V.

Chawla

University of Notre Dame

Distributed Data Mining Data Mining on Clouds Abstraction for Distributed Data Mining Implementing the Abstraction Evaluating the Abstraction Take-aways

Distributed Data Mining

For training D, testing T, and classifier F:

Divide D into N partitions with partitioner P

Run N copies of F, one on each partition, generating a set of votes on T for each partition

Collect votes from all copies of F and combine into a final result R

Challenges in Distributed DM

When dealing with large amounts of data (MB to GB to TB), there are systems problems in addition to data mining problems.

Why should data miners have to be distributed systems experts too?

Scalable (in terms of data size and number of resources) distributed data mining architectures tend to be finely tailored to an application and algorithm.

Proposed Solution

An abstraction framework for distributed data mining An abstraction allows users to declare a distributed

workload based on only what they know (sequential programs, data)

Why an abstraction? Abstractions hide many complexities from users Unlike a specially-tailored implementation, a

conceptual abstraction provides a general-purpose solution for a problem which may be implemented in any of several ways depending on requirements.

Clusters versus Cloud Computers

Small (4-16) to very large

Use shared filesystem, often centralized

Assign dedicated resources, often in large blocks

Often static and generally homogeneous

Managed by batch or grid engine

Large (~500 CPUs, ~300 disks @ ND)

Use individual disks rather than a central FS

Assign resources dynamically, without a guarantee of dedicated access

Commodity, Dynamic, and Heterogeneous

Managed by batch or grid engine

Implementing the Abstraction

There are several factors to consider: How many nodes to

use for computation? How many nodes to

use for data. How to connect the

data and computation nodes?

Streaming

Each process is connected via a data stream.

Data exists only in buffers in memory, and stream writers block until stream readers have consumed the buffer.

Requires full-way parallelism to complete.

Not robust to failure.

Partitioning is done ahead of computation and partitions are stored on the source node.

Computation jobs pull in the proper partition from the source node.

Flexible and robust to failure, but not scalable to a large number of computation nodes.

P1 P2 P3

Condor Matchmaker

Work assignments are done ahead of partitioning and partitioning distributes data to where it will be used.

Data are accessed locally where possible, or accessed in-place remotely.

This improves scalability to larger numbers of computation nodes, but can decrease flexibility and increase reliance on unreliable nodes.

P4Condor Matchmaker

Hybrid

Push to a well-known set of intermediate nodes.

Pull from those nodes. This combines the

advantages of Pull (flexibility, reliability) and Push (I/O performance)

.data P1

Condor Matchmaker

Hybrid

Implementing the Abstraction

The effectiveness of these possibilities hinges on the flexibility, reliability, and performance of their components.

An example of such a component is the partitioning algorithm.

Partitioning Algorithms

Shuffle: One instance at a time from the training data, copy into a partition.

Chop: One partition at a time, copy all its instances from the training data

CDEFGHIJ

Shuffle

CDEFGHIJ

5.4G / Locals: using fgets, fprintf. R16s: using fgets, chirp_stream_write, intra-sc0 cluster.

Partitioning Conclusions

Remote partitioning is faster, but less reliable, than local partitioning

Shuffle is slower locally and to a small number of remote hosts but scales better to a large number of remote hosts

Shuffle is less robust than Chop for large data sets

Evaluating the Architectures

Evaluation is based on performance and scalability. Classifier algorithms were decision trees, K-nearest

neighbors, and support vector machines.

Protein Data Set (3.3M instances, 170MB), Using Decision Trees

KDDCup Data Set (4.9M instances, 700MB), Using Decision Trees

Alpha Data Set (400K instances, 1.8GB), Using KNN

System Architectures

Push Fastest (remote part., mainly local access, etc.) 1-to-1 matching or heavy preference.

Could have pure 1-to-1 matching, but more fragile.

Pull Slowest (local part, on-jobstart transfer) Most robust (central data, “any” host can run jobs)

Hybrid Combination: Push to subset of nodes, then Pull. Faster than Pull (remote part., multiple servers), More robust than Push (small set of servers)

Future Work

Performance vs. Accuracy for long-tail jobs Is there a viable tradeoff between turnaround

time and degrading classification accuracy? Efficient data management on multicores Hierarchical abstraction framework

Submit jobs to clouds of subnets of multicores

Conclusions

Hybrid method is amenable to both cluster-like environments and larger, more-diverse clouds, and its use of intermediate data servers mitigates some of shuffle’s problems.

A fundamental limit of scalability is the available memory on each workstation. For our largest sets, even 16 nodes were not sufficient to run effectively.

Questions?

Data Analysis and Inference Laboratory Karsten Steinhaeuser (ksteinha@cse.nd.edu) Nitesh V. Chawla (nchawla@cse.nd.edu)

Cooperative Computing Laboratory Christopher Moretti (cmoretti@cse.nd.edu) Douglas Thain (dthain@cse.nd.edu)

Acknowledgements: NSF CNS-06-43229, CCF-06-21434, CNS-07-20813

1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas...

Documents

NITESH ASSOCIATES

Reliance Nitesh

Nitesh Final Marketing

Nitesh mayfair

Nitesh camp david

Barry Thain - Mentor in Hypnosis

Nitesh hyde park

Investor Presentation - Nitesh · PDF fileInvestor Presentation 5 Nitesh Estates: ... Nitesh Hyde Park (partly) ... Retail Private Ltd. (NIRPL) Nitesh Property

Nitesh Project PDF

Nitesh Estates : Nitesh Napa Valley

Farming with Condor Douglas Thain thain@cs.wisc.edu INFN Bologna, December 2001

Nitesh Long Island

Portfolio nitesh

Nitesh columbus square

nitESH 11.04.2013

Nitesh caesars palace

Nitesh Moon (207ce213)

Nitesh Estates

Meet/210551_20110812.pdf · Nitesh Hyde Park, Bangalore Nitesh ... Construction Co. Ltd 166 Nitesh ... Nitesh Estates Limited 7th Floor, Nitesh Timesquare

Nitesh logos