Upload
finigan-joyce
View
232
Download
0
Embed Size (px)
Citation preview
8/10/2019 Octopus MLDM
1/69
[email protected]@nju.edu.cn
2014.10.232014.10.23
8/10/2019 Octopus MLDM
2/69
8/10/2019 Octopus MLDM
3/69
8/10/2019 Octopus MLDM
4/69
Research Motivations
8/10/2019 Octopus MLDM
5/69
Challengesthat
Big
Data
MLDM
brings
computation upon large-scale dataset in acceptable time
Serial machine learning algorithms do not fit and work onan of existin arallel com utin latforms
on different parallel computing platforms
8/10/2019 Octopus MLDM
6/69
Challengesthat
Big
Data
MLDM
brings
x s ng mac ne earn ng an a a m n ng a gor ms
need to rewrite in parallel for big data
need to rewrite for different big data processing platforms
8/10/2019 Octopus MLDM
7/69
FrequentItemset Mining
Algorithm
and often used algorithm for data mining
Apriori algorithm is the most established algorithm forfinding frequent itemset from a transactional dataset
Tao Xiao, Shuai Wang, Chunfeng Yuan, Yihua Huang. PSON: A Parallelized SONAlgorithm with MapReduce for Mining Frequent Sets. The Fourth InternationalS m osium on Parallel Architectures Al orithms and Pro rammin PAAP 2011252-257, 2011
Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang. YAFIM: A Parallel
Frequent Itemset Mining Algorithm with Spark. The 3rd International Workshopon Parallel and Distributed Computing for Large Scale Machine Learning and Big DataAnalytics, conjunction with IPDPS 2014, May 23, 2014. Phoenix, USA
8/10/2019 Octopus MLDM
8/69
requen emse n ng gor mApriori algorithm
Needs multiple passes over the database
In the first pass, all frequent 1-itemsets are discovered In each subsequent pass, frequent (k+1)-itemsets are discovered, with the frequent
k- itemsets found in the previous pass as the seed (referred to as candidate itemsets)
Repeat until no more frequent itemsets can be found
8/10/2019 Octopus MLDM
9/69
Frequent Itemset Mining Algorithm
A riori Al orithm 1 :
[1] Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499
8/10/2019 Octopus MLDM
10/69
Frequent Itemset Mining Algorithm
Apriori in MapReduce:
8/10/2019 Octopus MLDM
11/69
Frequent Itemset Mining Algorithm
Tao Xiao, Shuai Wang, Chunfeng Yuan, Yihua Huang. PSON: A Parallelized SON Algorithm withMapReduce for Mining Frequent Sets. The Fourth International Symposium on Parallel Architectures,Algorithms and Programming, PAAP 2011, p 252-257, 2011
8/10/2019 Octopus MLDM
12/69
requent temset n ng gor t mMa Reduce
Parallel Aprioir algorithm with MapReduce needs to run
the Ma Reduce ob iterativel It need to scan the dataset iteratively and store all the
intermediate data in HDFS
As a result, the parallel Apriori algorithm withMapReduce is not efficient enough
8/10/2019 Octopus MLDM
13/69
requen emse n ng gor mS ark
YAFIM, Apriori algorithm implemented in Spark Model,
Our YAFIM contains two hases to find all fre uent
itemsets Phase : Load transaction datasets as a Spark RDD object and
genera e - requen emse s;
Phase : Iteratively generate (k+1)-frequent itemset from k-
fre uent itemset.Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang. YAFIM: A Parallel Frequent ItemsetMining Algorithm with Spark. The 3rd International Workshop on Parallel and Distributed Computingfor Large Scale Machine Learning and Big Data Analytics, conjunction with IPDPS 2014, May 23, 2014.
,
8/10/2019 Octopus MLDM
14/69
requen emse n ng gor mS ark
data into a RDD
All transaction data
reside in RDD
8/10/2019 Octopus MLDM
15/69
requen emse n ng gor mS ark
Phase
8/10/2019 Octopus MLDM
16/69
requen emse n ng gor mS ark
Phase I
8/10/2019 Octopus MLDM
17/69
requen emse n ng gor mS ark
x w ubenchmarks [3] with different characteristics:
us oomT10I4D100K
Chess
Pumsb_star
8/10/2019 Octopus MLDM
18/69
requen emse n ng gor mS ark
8/10/2019 Octopus MLDM
19/69
requen emse n ng gor mS ark
8/10/2019 Octopus MLDM
20/69
requen emse n ng gor mS ark
8/10/2019 Octopus MLDM
21/69
K-Means
K-Means Clustering Algorithm
Input:A dataset of N data points that need to be clustered into K
clusterOutputK clusters
Choose k cluster center Centers[K] as initial cluster centers
Loop:for each data point P from dataset
Calculate the distance between P and each of Centers[i]
Save to the nearest cluster centerRecalculate the new Centers[K]
Go loop until cluster centers converge
8/10/2019 Octopus MLDM
22/69
K-Means
K-Means Clustering Algorithm
-
class Mapper
setu { read k cluster centers Centers[K]; }
map(key, p) // p is a data point
minDis = Double.MAX VALUE;index = -1;for i=0 to Centers.len th{ dis= ComputeDist(p, Centers[i]);
if dis < minDis
{ minDis = dis;}
}emit Centers i .ClusterID, ,1 ;
}
8/10/2019 Octopus MLDM
23/69
K-Means
K-Means Clustering Algorithm
-
To optimize the data I/O and network transfer, we can use Combiner to-
class Combiner
reduce ClusterID 11 21
{pm = 0.0
n = [(p1,1), (p2,1), ];
for i=0 to n
pm = pm / n; // Calculate the average of points in the Cluster
emit ClusterID, m, n ; // use it as new Center
}
8/10/2019 Octopus MLDM
24/69
K-Means
-MapReduceK-Means
class Reducer
reduce ClusterID valueList = m1n1 m2n2
{
pm = 0.0 n=0;
k = length of valuelist belonging to a ClusterID;
for i=0 to k
+= * +=
pm = pm / n; // calculate new center of the Cluster
emit(ClusterID, (pm,n)); // output new center of the Cluster}
In main() function of the MapReduce Job, set a loop to run the
8/10/2019 Octopus MLDM
25/69
K-Means
K-Means Clustering Algorithm
-
while(tempDist > convergeDist && tempIter < MaxIter)
varclosest = data.map ( p => (closestPoint(p, kPoints), (p, 1))) // determine nearest center foreach P
// calculate the avera e of all oints in a cluster as new center
varpointStats = closest.reduceByKey{case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2)}varnewPoints = pointStats.map {pair => (pair._1, pair._2._1 / pair._2._2)}.collectAsMap()
= .
for (i
8/10/2019 Octopus MLDM
26/69
K-Means
-SparkK-MeansSpark speedup about 4-5 times compared to MapReduce
ntime(s)
Executi
Number of Nodes 1st iteration next iteration
Peng Liu, Jiayu Teng, Yihua Huang.
Study of k-means algorithm parallelization performance based on spark.
CCF Bi Data 2014 Bei in on review
8/10/2019 Octopus MLDM
27/69
NaiveBayes Classification Algorithm
Given m classes from training dataset: { C1,C
2, , C
m}
| 1i
map iC C
c arg max P C X i m
.
|| i ii P X C P C P C XP X
=> Only need to calculate P | i iX C P C
n
1| ( | )
i k ikP X C P x C
Supposexk is independent to each other =>
,i i
8/10/2019 Octopus MLDM
28/69
NaiveBayes Classification Algorithm
Training Map Pseudo Code to calculate P(X|Ci) and P(Ci)class Mapper
map(key, tr) // tr is a training sample
tr trid, X, Ciemit(Ci, 1)
or = o . eng
{ X[j] xnj & xvj // xnj: name if xj, xvj: value of xj
emit(, 1)}
}
8/10/2019 Octopus MLDM
29/69
NaiveBayes Classification Algorithm
Training Reduce Pseudo Code to calculate P(xj|Ci) and P(Ci)class Reducer
reduce(key, value_list) // key: either Ci or
sum =0; // count for P(xj|Ci) and P(Ci)while(value_list.hasNext())
sum += value_list.next().get();
emit(key, sum)
}// Trim and save output as P(xj|Ci) and P(Ci) tables in HDFS
8/10/2019 Octopus MLDM
30/69
NaiveBayes Classification Algorithm
Predict Map Pseudo Code to Predict Test Sample
class Ma er setup()
{ load P(xj|Ci) and P(Ci) data from training stageFC = { (Ci, P(Ci)) }, FxC = { (, P(xj|Ci)) }
map(key, ts) // ts is a test sample{ ts tsid, X
MaxF = MIN_VALUE; idx = -1;= .
{ FXCi = 1.0Ci = FC[i].Ci; FCi = FC[i].P(Ci)for (j=0 to X.length){ xnj = X[j].xnj; xvj = X[j].xvj
, , ,FXCi = FXCYi * P(xj|Ci);
}if(FXCi* FCi >MaxF) { MaxF = FXCi*FCi; idx = i; }
emit(tsid, FC[idx].Ci)}
8/10/2019 Octopus MLDM
31/69
NaiveBayes Classification Algorithm
Training SparkR Code to calculate P(xj|Ci) and P(Ci)
8/10/2019 Octopus MLDM
32/69
NaiveBayes Classification Algorithm
Predict SparkR Codepre c
8/10/2019 Octopus MLDM
33/69
NaiveBayes Classification Algorithm
TrainingDatasetthousand
250 35 s 13 s 2.69
500 40 s 14 s 2.851000 49 s 16 s 3.06
2000 66 s 18 s 3.67
q ang u, ong u, ua uang.
The Parallelization of Classification Algorithms Based on SparkR.
CCF Big Data 2014, Beijing, Accepted
8/10/2019 Octopus MLDM
34/69
8/10/2019 Octopus MLDM
35/69
Large Scale Deep Learning on Intel Xeon Phiore ara e gor ms e o
Manycore Coprocessor with OpenMP
60
cores 30
cores
BaseLine 16024s 15960s
OpenMP 892s 2122s
OpenMP+MKL 97s 120s
Improved
O enMP+MKL
53s 81s
Speedup(fully
optimizedcompared
302 197
Lei Jin, Rong Gu, Chunfeng Yuan and Yihua Huang. Large Scale
Deep Learning On Xeon Phi Many-core Coprocessor. The 3rd
Large Scale Machine Learning and Big Data Analytics, conjunction
with IPDPS 2014, May 23, 2014. Phoenix, USA
8/10/2019 Octopus MLDM
36/69
Large Scale Learning to Rank based onore ara e gor ms e o
ra en oos ng ec s on ree w
Research Grant from Baidu
8/10/2019 Octopus MLDM
37/69
Large Scale Learning to Rank based onore ara e gor ms e o
Gradient Boosting Decision Tree with MPI
Implemented parallel algorithm with MPI achieves 1.5 speedupcompare w ex s ng a gor m rom a u
8/10/2019 Octopus MLDM
38/69
Customized Light-weighted Parallel Computing Platformore ara e gor ms e o
for Large Scale Neural Network Training
Rong Gu, Furao Shen, and Yihua Huang.A Parallel Comput ing
Platform for Training Large Scale Neural Networks. Proceedings
of the IEEE International Conference on Big Data (IEEE BigData
2013), pp. 376 - 384, Santa Clara, CA, USA, Oct. 6-9, 2013
8/10/2019 Octopus MLDM
39/69
Summary
MLDM
MLDM
8/10/2019 Octopus MLDM
40/69
Part2Part2
Unified Programming Model and Platform fornified Programming Model and Platform for
Machine Learning Data Miningachine Learning Data Mining
8/10/2019 Octopus MLDM
41/69
esearc o va ons an oa s
Two fundamental goals of developing computing technology
+
Fast Continously improve the performance
Easy to UseContinously improve the usability
8/10/2019 Octopus MLDM
42/69
esearc o va ons an oa s
2007Hadoop
2013 S ark
8/10/2019 Octopus MLDM
43/69
esearc o va ons an oa s
OpenMP
MPI
8/10/2019 Octopus MLDM
44/69
esearc o va ons an oa s
vs.
SQLSQL
ve, mpa a, parTranswarp Incepter
,Spark Mllib
Octopus
8/10/2019 Octopus MLDM
45/69
8/10/2019 Octopus MLDM
46/69
What we do for this?esearc o va ons an oa s
We provide an unified programming model and platform to
bridge the gap between data analysts and parallel computing
U
MPI
Spark
nified&eas
P
rogram
ytouse
ing MapReduce
8/10/2019 Octopus MLDM
47/69
Problem for professional parallel programmers:esearc o va ons an oa s
A number of parallel computing platforms multiplying hundreds of
machine learning algorithms will generates a lot of duplicated work and
burden to rewrite all algorithms across different platforms
MPILotsofduplicated
Hundredof
MLDM
Algorithms
Spark
MapReduce
rewriteallMLDM
algorithms
What we do for this?We provide a unified programming model and platform for parallelprogrammers to write their MLDM algorithms once but run anywhere!
8/10/2019 Octopus MLDM
48/69
ecen esearc a us
RhadoopRevolution AnalyticsRHadoopRJava
RhadoopSparkRpbdR
SparkRSparkRRAPI,RRSpark RDD API
MapReduceSparkRDDMPI
Spark MLlibMLDM
R
Hadoop/Spark
pbdRRMPI
RHPC
/MPI
R
R
MLDM
8/10/2019 Octopus MLDM
49/69
MLDM
Basic Ideas
MLDM algorithms can be represented as matrix computations
Adopt matrix as the unified abstraction to represent a variety of machine
learning and data mining(MLDM) algorithms
Provide a high-level matrix model-based MLDM parallel programming andcomputing model
rov e a marx mo e- ase eas - o-use an un e programmnglanguage and software framework to support the model
MLDM
8/10/2019 Octopus MLDM
50/69
MLDM
Basic Ideas
MLDMplug-in
Im lement lu -ins for each of underl in arallel com utin latforms,mapping the high-level MLDM programs along with matrix computation
to underlying platforms
,Implement and provide optimized large-scale matrix computation and
,of underlying platforms to programmers and write once, run anyway
MLDMDesign and provide parallel MLDM algorithm libarary
MLDM
8/10/2019 Octopus MLDM
51/69
MLDM
Architectural Overview Octopus Project
MLDM
We have initiated a research
develop a cross-platform and
unified MLDM programming
,platform
8/10/2019 Octopus MLDM
52/69
MLDM
8/10/2019 Octopus MLDM
53/69
MLDM
Architectural Overview
Spark
8/10/2019 Octopus MLDM
54/69
Spark
Distributed Matrix Computation Lib with Spark
Marlin: Octopus sub-projectSpark Distributed Matrix Lib
htt s://code.csdn.net/u014252240/s arkmatrixlib
Currently either R or Spark does not provide any ability to
operate large-scale matrix
- .
Distributed Matrix is a critical and fundamental component
model on top of Spark
Spark
8/10/2019 Octopus MLDM
55/69
Spark
S ark Mllib OverviewDistributed Matrix Computation Lib with Spark
par
BLAS/LAPACK
Spark
8/10/2019 Octopus MLDM
56/69
Spark
-Distributed Matrix Computation Lib with Spark
Spark Mllib
Marlin
API
Spark
8/10/2019 Octopus MLDM
57/69
p
Distributed Matrix Computation Lib with Spark-
AutomatedLarge
Scale
Matrix
Partition
and
Parallel
Execution
Manager
ScheduleandDispatch
SparkCluster
erver o es
Spark
8/10/2019 Octopus MLDM
58/69
p
Distributed Matrix Computation Lib with Spark
Spark
OctopusHadoopMPI
Spark
8/10/2019 Octopus MLDM
59/69
p
Spark-Matrix Lib PerformanceDistributed Matrix Computation Lib with Spark
Marlin
Spark
8/10/2019 Octopus MLDM
60/69
p
-Distributed Matrix Computation Lib with Spark
Spark-Matrix
8/10/2019 Octopus MLDM
61/69
Integrate Spark with Unified Platform - , ,
platform
allow Spark-Matrix Lib can be called from R language
loading and managing matrix data-
and partitioning and scheduling sub-matrix for
distributed execution;
or calling R-Matrix lib for small size matrix that can be
executed on a single machine.
Spark-Matrix
8/10/2019 Octopus MLDM
62/69
Octopus-R User InterfaceIntegrate Spark with Unified Platform
User Interface
from R Studio
to work with
Octopus
Panel to
write MLDM orany other
algorithm code
with matrix
ommanand result
window
Spark-Matrix
8/10/2019 Octopus MLDM
63/69
Octopus-R Demo AlgorithmIntegrate Spark with Unified Platform
Logistic
Regressionalgorithm coded
with Matrix
Underlying this
program will beexecuted on top of
our c opus
engine
8/10/2019 Octopus MLDM
64/69
Project Research Progress
Octopus
SparkRSparkRMLDM
Hadoop MapReduce
RHadoop
MPI
8/10/2019 Octopus MLDM
65/69
Project Research Progress
RMLDM
R
MLD
Spark Hadoop MPI
8/10/2019 Octopus MLDM
66/69
PASA
8/10/2019 Octopus MLDM
67/69
What
we
do
at
our
NJU
PASA
Big
Data
Lab
ur a stu es on
Parallel
gor ms
Systems, and
pp ca ons
for Big Data
Now we are contributor
Tachyon
PASAW t t NJU PASA Bi D t L
8/10/2019 Octopus MLDM
68/69
W atwe oatourNJUPASABigDataLa
Hadoop Intel ,
Tachyon UC BekerleyAMPHBase
HBaseRDF Intel Intel MIC Intel
, , ,
GBDT
Web Web , Intel
8/10/2019 Octopus MLDM
69/69
ContactInformationContactInformation
Dr.Dr.YihuaYihua Huang,ProfessorHuang,Professor
NJUNJUPASABigDataLabPASABigDataLabhttp://pasahttp://pasabigdata.nju.edu.cnbigdata.nju.edu.cn
Departmentof
Computer
Science
and
TechnologyDepartment
of
Computer
Science
and
Technology
an ng n vers y, an ng,an ng n vers y, an ng, . . na. . na
[email protected]@nju.edu.cn
TeTe 18918951675167
91279127