Map-Reduce for Machine Learning on MulticoreC. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006)
Shimin ChenBig Data Reading Group
Motivations Industry-wide shift to multicore No good framework for parallelize ML
algorithms
Goal: develop a general and exact technique for parallel programming of a large class of ML algorithms for multicore processors
Idea
Statistical Query Model
Summation Form
Map-Reduce
Outline Introduction Statistical Query Model and Summation
Form Architecture (inspired by Map-Reduce) Adopted ML Algorithms Experiments Conclusion
Valiant Model [Valiant’84]
x is the input y is a function of x that we want to
learn In Valiant model, the learning
algorithm uses randomly drawn examples <x, y> to learn the target function
Statistical Query Model [Kearns’98]
A restriction on Valiant model A learning algorithm uses some
aggregates over the examples, not the individual examples
More precisely, the learning algorithm interacts with a statistical query oracle Learning algorithm asks about f(x,y) Oracle returns the expectation that f(x,y) is
true
Summation Form
Aggregate over the data: Divide the data set into pieces Compute aggregates on each cores Combine all results at the end
Example: Linear Regression using Least SquaresModel:Goal:
Solution: Given m examples: (x1, y1), (x2, y2), …, (xm, ym) We write a matrix X with x1, …, xm as rows, and row vector Y=(y1, y2, …ym). Then the solution is
Parallel computation:
•
•
Cut to m/num_processor pieces
Outline Introduction Statistical Query Model and Summation
Form Architecture (inspired by Map-Reduce) Adopted ML Algorithms Experiments Conclusion
Lighter Weight Map-Reduce for Multicore
Outline Introduction Statistical Query Model and Summation
Form Architecture (inspired by Map-Reduce) Adopted ML Algorithms Experiments Conclusion
Locally Weighted Linear Regression (LWLR)
Mappers: one sets compute A, the other set compute b Two reducers for computing A and b Finally compute the solution
When wi==1, this is least squares.
Solve:
Naïve Bayes (NB) Goal: estimate P(xj=k|y=1) and P(xj=k|y=0) Computation: count the occurrence of (xj=k, y=1) and
(xj=k, y=0), count the occurrence of (y=1) and (y=0), the compute division
Mappers: count a subgroup of training samples Reducer: aggregate the intermediate counts, and
calculate the final result
Gaussian Discriminative Analysis (GDA) Goal: classification of x into classes of y
assuming each class is a Gaussian Mixture model with different means but same covariance.
Computation: Mappers: compute for a subset of training
samples Reducer: aggregate intermediate results
K-means Computing the Euclidean distance between
sample vectors and centroids Recalculating the centroids Divide the computation to subgroups to be
handled by map-reduce
Expectation Maximization (EM) E-step computes some prob or counts per
training example M-step combines these values to update the
parameters Both of them can be parallelized using map-
reduce
Neural Network (NN) Back-propagation, 3-layer network
Input, middle, 2 output nodes Goal: compute the weights in the NN by back
propagation
Mapper: propagate its set of training data through the network, and propagate errors to calculate the partial gradient for weights
Reducer: sums the partial gradients and does a batch gradient descent to update the weights
Principal Components Analysis (PCA) Compute the principle eigenvectors of the covariance
matrix
Clearly, we can compute the summation form using map-reduce
Other Algorithms
Logistic Regression Independent Component Analysis Support Vector Machine
Time Complexity
Outline Introduction Statistical Query Model and Summation
Form Architecture (inspired by Map-Reduce) Adopted ML Algorithms Experiments Conclusion
Setup Compare map-reduce version and sequential
version 10 data sets Machines:
Dual-processor Pentium-III 700MHz, 1GB RAM 16-way Sun Enterprise 6000 (these are SMP, not multicore)
Dual-Processor SpeedUps
2-16 processor speedups
More data in the paper
Multicore Simulator Results
A paragraph on this Basically, says that results are
better than multiprocessor machines. Could be because of less
communication cost
Conclusion
Parallelize summation forms Use map-reduce on a single
machine