Download ppt - Shimin Chen Big Data Reading Group

Map-Reduce for Machine Learning on MulticoreC. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006)

Shimin ChenBig Data Reading Group

Motivations Industry-wide shift to multicore No good framework for parallelize ML

algorithms

Goal: develop a general and exact technique for parallel programming of a large class of ML algorithms for multicore processors

Idea

Statistical Query Model

Summation Form

Map-Reduce

Outline Introduction Statistical Query Model and Summation

Form Architecture (inspired by Map-Reduce) Adopted ML Algorithms Experiments Conclusion

Valiant Model [Valiant’84]

x is the input y is a function of x that we want to

learn In Valiant model, the learning

algorithm uses randomly drawn examples <x, y> to learn the target function

Statistical Query Model [Kearns’98]

A restriction on Valiant model A learning algorithm uses some

aggregates over the examples, not the individual examples

More precisely, the learning algorithm interacts with a statistical query oracle Learning algorithm asks about f(x,y) Oracle returns the expectation that f(x,y) is

true

Summation Form

Aggregate over the data: Divide the data set into pieces Compute aggregates on each cores Combine all results at the end

Example: Linear Regression using Least SquaresModel:Goal:

Solution: Given m examples: (x1, y1), (x2, y2), …, (xm, ym) We write a matrix X with x1, …, xm as rows, and row vector Y=(y1, y2, …ym). Then the solution is

Parallel computation:

•

•

Cut to m/num_processor pieces



Lighter Weight Map-Reduce for Multicore



Locally Weighted Linear Regression (LWLR)

Mappers: one sets compute A, the other set compute b Two reducers for computing A and b Finally compute the solution

When wi==1, this is least squares.

Solve:

Naïve Bayes (NB) Goal: estimate P(xj=k|y=1) and P(xj=k|y=0) Computation: count the occurrence of (xj=k, y=1) and

(xj=k, y=0), count the occurrence of (y=1) and (y=0), the compute division

Mappers: count a subgroup of training samples Reducer: aggregate the intermediate counts, and

calculate the final result

Gaussian Discriminative Analysis (GDA) Goal: classification of x into classes of y

assuming each class is a Gaussian Mixture model with different means but same covariance.

Computation: Mappers: compute for a subset of training

samples Reducer: aggregate intermediate results

K-means Computing the Euclidean distance between

sample vectors and centroids Recalculating the centroids Divide the computation to subgroups to be

handled by map-reduce

Expectation Maximization (EM) E-step computes some prob or counts per

training example M-step combines these values to update the

parameters Both of them can be parallelized using map-

reduce

Neural Network (NN) Back-propagation, 3-layer network

Input, middle, 2 output nodes Goal: compute the weights in the NN by back

propagation

Mapper: propagate its set of training data through the network, and propagate errors to calculate the partial gradient for weights

Reducer: sums the partial gradients and does a batch gradient descent to update the weights

Principal Components Analysis (PCA) Compute the principle eigenvectors of the covariance

matrix

Clearly, we can compute the summation form using map-reduce

Other Algorithms

Logistic Regression Independent Component Analysis Support Vector Machine

Time Complexity



Setup Compare map-reduce version and sequential

version 10 data sets Machines:

Dual-processor Pentium-III 700MHz, 1GB RAM 16-way Sun Enterprise 6000 (these are SMP, not multicore)

Dual-Processor SpeedUps

2-16 processor speedups

More data in the paper

Multicore Simulator Results

A paragraph on this Basically, says that results are

better than multiprocessor machines. Could be because of less

communication cost

Conclusion

Parallelize summation forms Use map-reduce on a single

machine