黄宜华 Octopus 跨平台统一MLDM编程模型与平台

Embed Size (px)

Citation preview

  • 8/10/2019 Octopus MLDM

    1/69

    [email protected]@nju.edu.cn

    2014.10.232014.10.23

  • 8/10/2019 Octopus MLDM

    2/69

  • 8/10/2019 Octopus MLDM

    3/69

  • 8/10/2019 Octopus MLDM

    4/69

    Research Motivations

  • 8/10/2019 Octopus MLDM

    5/69

    Challengesthat

    Big

    Data

    MLDM

    brings

    computation upon large-scale dataset in acceptable time

    Serial machine learning algorithms do not fit and work onan of existin arallel com utin latforms

    on different parallel computing platforms

  • 8/10/2019 Octopus MLDM

    6/69

    Challengesthat

    Big

    Data

    MLDM

    brings

    x s ng mac ne earn ng an a a m n ng a gor ms

    need to rewrite in parallel for big data

    need to rewrite for different big data processing platforms

  • 8/10/2019 Octopus MLDM

    7/69

    FrequentItemset Mining

    Algorithm

    and often used algorithm for data mining

    Apriori algorithm is the most established algorithm forfinding frequent itemset from a transactional dataset

    Tao Xiao, Shuai Wang, Chunfeng Yuan, Yihua Huang. PSON: A Parallelized SONAlgorithm with MapReduce for Mining Frequent Sets. The Fourth InternationalS m osium on Parallel Architectures Al orithms and Pro rammin PAAP 2011252-257, 2011

    Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang. YAFIM: A Parallel

    Frequent Itemset Mining Algorithm with Spark. The 3rd International Workshopon Parallel and Distributed Computing for Large Scale Machine Learning and Big DataAnalytics, conjunction with IPDPS 2014, May 23, 2014. Phoenix, USA

  • 8/10/2019 Octopus MLDM

    8/69

    requen emse n ng gor mApriori algorithm

    Needs multiple passes over the database

    In the first pass, all frequent 1-itemsets are discovered In each subsequent pass, frequent (k+1)-itemsets are discovered, with the frequent

    k- itemsets found in the previous pass as the seed (referred to as candidate itemsets)

    Repeat until no more frequent itemsets can be found

  • 8/10/2019 Octopus MLDM

    9/69

    Frequent Itemset Mining Algorithm

    A riori Al orithm 1 :

    [1] Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499

  • 8/10/2019 Octopus MLDM

    10/69

    Frequent Itemset Mining Algorithm

    Apriori in MapReduce:

  • 8/10/2019 Octopus MLDM

    11/69

    Frequent Itemset Mining Algorithm

    Tao Xiao, Shuai Wang, Chunfeng Yuan, Yihua Huang. PSON: A Parallelized SON Algorithm withMapReduce for Mining Frequent Sets. The Fourth International Symposium on Parallel Architectures,Algorithms and Programming, PAAP 2011, p 252-257, 2011

  • 8/10/2019 Octopus MLDM

    12/69

    requent temset n ng gor t mMa Reduce

    Parallel Aprioir algorithm with MapReduce needs to run

    the Ma Reduce ob iterativel It need to scan the dataset iteratively and store all the

    intermediate data in HDFS

    As a result, the parallel Apriori algorithm withMapReduce is not efficient enough

  • 8/10/2019 Octopus MLDM

    13/69

    requen emse n ng gor mS ark

    YAFIM, Apriori algorithm implemented in Spark Model,

    Our YAFIM contains two hases to find all fre uent

    itemsets Phase : Load transaction datasets as a Spark RDD object and

    genera e - requen emse s;

    Phase : Iteratively generate (k+1)-frequent itemset from k-

    fre uent itemset.Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang. YAFIM: A Parallel Frequent ItemsetMining Algorithm with Spark. The 3rd International Workshop on Parallel and Distributed Computingfor Large Scale Machine Learning and Big Data Analytics, conjunction with IPDPS 2014, May 23, 2014.

    ,

  • 8/10/2019 Octopus MLDM

    14/69

    requen emse n ng gor mS ark

    data into a RDD

    All transaction data

    reside in RDD

  • 8/10/2019 Octopus MLDM

    15/69

    requen emse n ng gor mS ark

    Phase

  • 8/10/2019 Octopus MLDM

    16/69

    requen emse n ng gor mS ark

    Phase I

  • 8/10/2019 Octopus MLDM

    17/69

    requen emse n ng gor mS ark

    x w ubenchmarks [3] with different characteristics:

    us oomT10I4D100K

    Chess

    Pumsb_star

  • 8/10/2019 Octopus MLDM

    18/69

    requen emse n ng gor mS ark

  • 8/10/2019 Octopus MLDM

    19/69

    requen emse n ng gor mS ark

  • 8/10/2019 Octopus MLDM

    20/69

    requen emse n ng gor mS ark

  • 8/10/2019 Octopus MLDM

    21/69

    K-Means

    K-Means Clustering Algorithm

    Input:A dataset of N data points that need to be clustered into K

    clusterOutputK clusters

    Choose k cluster center Centers[K] as initial cluster centers

    Loop:for each data point P from dataset

    Calculate the distance between P and each of Centers[i]

    Save to the nearest cluster centerRecalculate the new Centers[K]

    Go loop until cluster centers converge

  • 8/10/2019 Octopus MLDM

    22/69

    K-Means

    K-Means Clustering Algorithm

    -

    class Mapper

    setu { read k cluster centers Centers[K]; }

    map(key, p) // p is a data point

    minDis = Double.MAX VALUE;index = -1;for i=0 to Centers.len th{ dis= ComputeDist(p, Centers[i]);

    if dis < minDis

    { minDis = dis;}

    }emit Centers i .ClusterID, ,1 ;

    }

  • 8/10/2019 Octopus MLDM

    23/69

    K-Means

    K-Means Clustering Algorithm

    -

    To optimize the data I/O and network transfer, we can use Combiner to-

    class Combiner

    reduce ClusterID 11 21

    {pm = 0.0

    n = [(p1,1), (p2,1), ];

    for i=0 to n

    pm = pm / n; // Calculate the average of points in the Cluster

    emit ClusterID, m, n ; // use it as new Center

    }

  • 8/10/2019 Octopus MLDM

    24/69

    K-Means

    -MapReduceK-Means

    class Reducer

    reduce ClusterID valueList = m1n1 m2n2

    {

    pm = 0.0 n=0;

    k = length of valuelist belonging to a ClusterID;

    for i=0 to k

    += * +=

    pm = pm / n; // calculate new center of the Cluster

    emit(ClusterID, (pm,n)); // output new center of the Cluster}

    In main() function of the MapReduce Job, set a loop to run the

  • 8/10/2019 Octopus MLDM

    25/69

    K-Means

    K-Means Clustering Algorithm

    -

    while(tempDist > convergeDist && tempIter < MaxIter)

    varclosest = data.map ( p => (closestPoint(p, kPoints), (p, 1))) // determine nearest center foreach P

    // calculate the avera e of all oints in a cluster as new center

    varpointStats = closest.reduceByKey{case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2)}varnewPoints = pointStats.map {pair => (pair._1, pair._2._1 / pair._2._2)}.collectAsMap()

    = .

    for (i

  • 8/10/2019 Octopus MLDM

    26/69

    K-Means

    -SparkK-MeansSpark speedup about 4-5 times compared to MapReduce

    ntime(s)

    Executi

    Number of Nodes 1st iteration next iteration

    Peng Liu, Jiayu Teng, Yihua Huang.

    Study of k-means algorithm parallelization performance based on spark.

    CCF Bi Data 2014 Bei in on review

  • 8/10/2019 Octopus MLDM

    27/69

    NaiveBayes Classification Algorithm

    Given m classes from training dataset: { C1,C

    2, , C

    m}

    | 1i

    map iC C

    c arg max P C X i m

    .

    || i ii P X C P C P C XP X

    => Only need to calculate P | i iX C P C

    n

    1| ( | )

    i k ikP X C P x C

    Supposexk is independent to each other =>

    ,i i

  • 8/10/2019 Octopus MLDM

    28/69

    NaiveBayes Classification Algorithm

    Training Map Pseudo Code to calculate P(X|Ci) and P(Ci)class Mapper

    map(key, tr) // tr is a training sample

    tr trid, X, Ciemit(Ci, 1)

    or = o . eng

    { X[j] xnj & xvj // xnj: name if xj, xvj: value of xj

    emit(, 1)}

    }

  • 8/10/2019 Octopus MLDM

    29/69

    NaiveBayes Classification Algorithm

    Training Reduce Pseudo Code to calculate P(xj|Ci) and P(Ci)class Reducer

    reduce(key, value_list) // key: either Ci or

    sum =0; // count for P(xj|Ci) and P(Ci)while(value_list.hasNext())

    sum += value_list.next().get();

    emit(key, sum)

    }// Trim and save output as P(xj|Ci) and P(Ci) tables in HDFS

  • 8/10/2019 Octopus MLDM

    30/69

    NaiveBayes Classification Algorithm

    Predict Map Pseudo Code to Predict Test Sample

    class Ma er setup()

    { load P(xj|Ci) and P(Ci) data from training stageFC = { (Ci, P(Ci)) }, FxC = { (, P(xj|Ci)) }

    map(key, ts) // ts is a test sample{ ts tsid, X

    MaxF = MIN_VALUE; idx = -1;= .

    { FXCi = 1.0Ci = FC[i].Ci; FCi = FC[i].P(Ci)for (j=0 to X.length){ xnj = X[j].xnj; xvj = X[j].xvj

    , , ,FXCi = FXCYi * P(xj|Ci);

    }if(FXCi* FCi >MaxF) { MaxF = FXCi*FCi; idx = i; }

    emit(tsid, FC[idx].Ci)}

  • 8/10/2019 Octopus MLDM

    31/69

    NaiveBayes Classification Algorithm

    Training SparkR Code to calculate P(xj|Ci) and P(Ci)

  • 8/10/2019 Octopus MLDM

    32/69

    NaiveBayes Classification Algorithm

    Predict SparkR Codepre c

  • 8/10/2019 Octopus MLDM

    33/69

    NaiveBayes Classification Algorithm

    TrainingDatasetthousand

    250 35 s 13 s 2.69

    500 40 s 14 s 2.851000 49 s 16 s 3.06

    2000 66 s 18 s 3.67

    q ang u, ong u, ua uang.

    The Parallelization of Classification Algorithms Based on SparkR.

    CCF Big Data 2014, Beijing, Accepted

  • 8/10/2019 Octopus MLDM

    34/69

  • 8/10/2019 Octopus MLDM

    35/69

    Large Scale Deep Learning on Intel Xeon Phiore ara e gor ms e o

    Manycore Coprocessor with OpenMP

    60

    cores 30

    cores

    BaseLine 16024s 15960s

    OpenMP 892s 2122s

    OpenMP+MKL 97s 120s

    Improved

    O enMP+MKL

    53s 81s

    Speedup(fully

    optimizedcompared

    302 197

    Lei Jin, Rong Gu, Chunfeng Yuan and Yihua Huang. Large Scale

    Deep Learning On Xeon Phi Many-core Coprocessor. The 3rd

    Large Scale Machine Learning and Big Data Analytics, conjunction

    with IPDPS 2014, May 23, 2014. Phoenix, USA

  • 8/10/2019 Octopus MLDM

    36/69

    Large Scale Learning to Rank based onore ara e gor ms e o

    ra en oos ng ec s on ree w

    Research Grant from Baidu

  • 8/10/2019 Octopus MLDM

    37/69

    Large Scale Learning to Rank based onore ara e gor ms e o

    Gradient Boosting Decision Tree with MPI

    Implemented parallel algorithm with MPI achieves 1.5 speedupcompare w ex s ng a gor m rom a u

  • 8/10/2019 Octopus MLDM

    38/69

    Customized Light-weighted Parallel Computing Platformore ara e gor ms e o

    for Large Scale Neural Network Training

    Rong Gu, Furao Shen, and Yihua Huang.A Parallel Comput ing

    Platform for Training Large Scale Neural Networks. Proceedings

    of the IEEE International Conference on Big Data (IEEE BigData

    2013), pp. 376 - 384, Santa Clara, CA, USA, Oct. 6-9, 2013

  • 8/10/2019 Octopus MLDM

    39/69

    Summary

    MLDM

    MLDM

  • 8/10/2019 Octopus MLDM

    40/69

    Part2Part2

    Unified Programming Model and Platform fornified Programming Model and Platform for

    Machine Learning Data Miningachine Learning Data Mining

  • 8/10/2019 Octopus MLDM

    41/69

    esearc o va ons an oa s

    Two fundamental goals of developing computing technology

    +

    Fast Continously improve the performance

    Easy to UseContinously improve the usability

  • 8/10/2019 Octopus MLDM

    42/69

    esearc o va ons an oa s

    2007Hadoop

    2013 S ark

  • 8/10/2019 Octopus MLDM

    43/69

    esearc o va ons an oa s

    OpenMP

    MPI

  • 8/10/2019 Octopus MLDM

    44/69

    esearc o va ons an oa s

    vs.

    SQLSQL

    ve, mpa a, parTranswarp Incepter

    ,Spark Mllib

    Octopus

  • 8/10/2019 Octopus MLDM

    45/69

  • 8/10/2019 Octopus MLDM

    46/69

    What we do for this?esearc o va ons an oa s

    We provide an unified programming model and platform to

    bridge the gap between data analysts and parallel computing

    U

    MPI

    Spark

    nified&eas

    P

    rogram

    ytouse

    ing MapReduce

  • 8/10/2019 Octopus MLDM

    47/69

    Problem for professional parallel programmers:esearc o va ons an oa s

    A number of parallel computing platforms multiplying hundreds of

    machine learning algorithms will generates a lot of duplicated work and

    burden to rewrite all algorithms across different platforms

    MPILotsofduplicated

    Hundredof

    MLDM

    Algorithms

    Spark

    MapReduce

    rewriteallMLDM

    algorithms

    What we do for this?We provide a unified programming model and platform for parallelprogrammers to write their MLDM algorithms once but run anywhere!

  • 8/10/2019 Octopus MLDM

    48/69

    ecen esearc a us

    RhadoopRevolution AnalyticsRHadoopRJava

    RhadoopSparkRpbdR

    SparkRSparkRRAPI,RRSpark RDD API

    MapReduceSparkRDDMPI

    Spark MLlibMLDM

    R

    Hadoop/Spark

    pbdRRMPI

    RHPC

    /MPI

    R

    R

    MLDM

  • 8/10/2019 Octopus MLDM

    49/69

    MLDM

    Basic Ideas

    MLDM algorithms can be represented as matrix computations

    Adopt matrix as the unified abstraction to represent a variety of machine

    learning and data mining(MLDM) algorithms

    Provide a high-level matrix model-based MLDM parallel programming andcomputing model

    rov e a marx mo e- ase eas - o-use an un e programmnglanguage and software framework to support the model

    MLDM

  • 8/10/2019 Octopus MLDM

    50/69

    MLDM

    Basic Ideas

    MLDMplug-in

    Im lement lu -ins for each of underl in arallel com utin latforms,mapping the high-level MLDM programs along with matrix computation

    to underlying platforms

    ,Implement and provide optimized large-scale matrix computation and

    ,of underlying platforms to programmers and write once, run anyway

    MLDMDesign and provide parallel MLDM algorithm libarary

    MLDM

  • 8/10/2019 Octopus MLDM

    51/69

    MLDM

    Architectural Overview Octopus Project

    MLDM

    We have initiated a research

    develop a cross-platform and

    unified MLDM programming

    ,platform

  • 8/10/2019 Octopus MLDM

    52/69

    MLDM

  • 8/10/2019 Octopus MLDM

    53/69

    MLDM

    Architectural Overview

    Spark

  • 8/10/2019 Octopus MLDM

    54/69

    Spark

    Distributed Matrix Computation Lib with Spark

    Marlin: Octopus sub-projectSpark Distributed Matrix Lib

    htt s://code.csdn.net/u014252240/s arkmatrixlib

    Currently either R or Spark does not provide any ability to

    operate large-scale matrix

    - .

    Distributed Matrix is a critical and fundamental component

    model on top of Spark

    Spark

  • 8/10/2019 Octopus MLDM

    55/69

    Spark

    S ark Mllib OverviewDistributed Matrix Computation Lib with Spark

    par

    BLAS/LAPACK

    Spark

  • 8/10/2019 Octopus MLDM

    56/69

    Spark

    -Distributed Matrix Computation Lib with Spark

    Spark Mllib

    Marlin

    API

    Spark

  • 8/10/2019 Octopus MLDM

    57/69

    p

    Distributed Matrix Computation Lib with Spark-

    AutomatedLarge

    Scale

    Matrix

    Partition

    and

    Parallel

    Execution

    Manager

    ScheduleandDispatch

    SparkCluster

    erver o es

    Spark

  • 8/10/2019 Octopus MLDM

    58/69

    p

    Distributed Matrix Computation Lib with Spark

    Spark

    OctopusHadoopMPI

    Spark

  • 8/10/2019 Octopus MLDM

    59/69

    p

    Spark-Matrix Lib PerformanceDistributed Matrix Computation Lib with Spark

    Marlin

    Spark

  • 8/10/2019 Octopus MLDM

    60/69

    p

    -Distributed Matrix Computation Lib with Spark

    Spark-Matrix

  • 8/10/2019 Octopus MLDM

    61/69

    Integrate Spark with Unified Platform - , ,

    platform

    allow Spark-Matrix Lib can be called from R language

    loading and managing matrix data-

    and partitioning and scheduling sub-matrix for

    distributed execution;

    or calling R-Matrix lib for small size matrix that can be

    executed on a single machine.

    Spark-Matrix

  • 8/10/2019 Octopus MLDM

    62/69

    Octopus-R User InterfaceIntegrate Spark with Unified Platform

    User Interface

    from R Studio

    to work with

    Octopus

    Panel to

    write MLDM orany other

    algorithm code

    with matrix

    ommanand result

    window

    Spark-Matrix

  • 8/10/2019 Octopus MLDM

    63/69

    Octopus-R Demo AlgorithmIntegrate Spark with Unified Platform

    Logistic

    Regressionalgorithm coded

    with Matrix

    Underlying this

    program will beexecuted on top of

    our c opus

    engine

  • 8/10/2019 Octopus MLDM

    64/69

    Project Research Progress

    Octopus

    SparkRSparkRMLDM

    Hadoop MapReduce

    RHadoop

    MPI

  • 8/10/2019 Octopus MLDM

    65/69

    Project Research Progress

    RMLDM

    R

    MLD

    Spark Hadoop MPI

  • 8/10/2019 Octopus MLDM

    66/69

    PASA

  • 8/10/2019 Octopus MLDM

    67/69

    What

    we

    do

    at

    our

    NJU

    PASA

    Big

    Data

    Lab

    ur a stu es on

    Parallel

    gor ms

    Systems, and

    pp ca ons

    for Big Data

    Now we are contributor

    Tachyon

    PASAW t t NJU PASA Bi D t L

  • 8/10/2019 Octopus MLDM

    68/69

    W atwe oatourNJUPASABigDataLa

    Hadoop Intel ,

    Tachyon UC BekerleyAMPHBase

    HBaseRDF Intel Intel MIC Intel

    , , ,

    GBDT

    Web Web , Intel

  • 8/10/2019 Octopus MLDM

    69/69

    ContactInformationContactInformation

    Dr.Dr.YihuaYihua Huang,ProfessorHuang,Professor

    NJUNJUPASABigDataLabPASABigDataLabhttp://pasahttp://pasabigdata.nju.edu.cnbigdata.nju.edu.cn

    Departmentof

    Computer

    Science

    and

    TechnologyDepartment

    of

    Computer

    Science

    and

    Technology

    an ng n vers y, an ng,an ng n vers y, an ng, . . na. . na

    [email protected]@nju.edu.cn

    TeTe 18918951675167

    91279127