15
Copyright©2016 NTT corp. All Rights Reserved. Hivemall Meets XGBoost in DataFrame/Spark 2016/9/8 Takeshi Yamamuro (maropu) @ NTT

20160908 hivemall meetup

Embed Size (px)

Citation preview

Page 1: 20160908 hivemall meetup

Copyright©2016 NTT corp. All Rights Reserved.

Hivemall  Meets  XGBoost  inDataFrame/Spark

2016/9/8Takeshi  Yamamuro  (maropu)  @  NTT

Page 2: 20160908 hivemall meetup

2Copyright©2016 NTT corp. All Rights Reserved.

Who  am  I?

Page 3: 20160908 hivemall meetup

3Copyright©2016 NTT corp. All Rights Reserved.

• Short  for  eXtreme  Gradient  Boosting•  https://github.com/dmlc /xgboost

• It  is...•  variant  of  the  gradient  boosting  machine

•  tree-‐‑‒based  model•  open-‐‑‒sourced  tool  (Apache2  license)  

•  written  in  C++•  R/python/Julia/Java/Scala  interfaces  provided

• widely  used  in  Kaggle  competitions

                is...

Page 4: 20160908 hivemall meetup

4Copyright©2016 NTT corp. All Rights Reserved.

• Most  of  Hivemall  functions  supported  in  Spark-‐‑‒v1.6  and  v2.0

•  the  v2.0  support  not  released  yet

• XGBoost  integration  under  development•  distributed/parallel  predictions•  native  libraries  bundled  for  major  platforms

•  Mac/Linux  on  x86_̲64•  how-‐‑‒to-‐‑‒use:  https://gist.github.com/maropu/33794b293ee937e99b8fb0788843fa3f

Hivemall  in  DataFrame/Spark

Page 5: 20160908 hivemall meetup

5Copyright©2016 NTT corp. All Rights Reserved.

Spark  Quick  Examples

• Fetch  a  binary  Spark  v2.0.0•  http://spark.apache.org/downloads.html

$ <SPARK_HOME>/bin/spark-shell scala> :paste val textFile = sc.textFile(”hoge.txt") val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

Page 6: 20160908 hivemall meetup

6Copyright©2016 NTT corp. All Rights Reserved.

Fetch  training  and  test  data

• E2006  tfidf  regression  dataset•  http://www.csie.ntu.edu.tw/~∼cjlin/libsvmtools/datasets/regression.html#E2006-‐‑‒tfidf

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.train.bz2 $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.test.bz2

Page 7: 20160908 hivemall meetup

7Copyright©2016 NTT corp. All Rights Reserved.

XGBoost  in  spark-‐‑‒shell

• Scala  interface  bundled  in  the  Hivemall  jar$ bunzip2 E2006.train.bz2 $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar scala> import ml.dmlc.xgboost4j.scala._ scala> :paste // Read trainining data val trainData = new DMatrix(”E2006.train") // Define parameters val paramMap = List( "eta" -> 0.1, "max_depth" -> 2, "objective" -> ”reg:logistic” ).toMap // Train the model val model = XGBoost.train(trainData, paramMap, 2) // Save model to the file model.saveModel(”xgboost_models_dir/xgb_0001.model”)

Page 8: 20160908 hivemall meetup

8Copyright©2016 NTT corp. All Rights Reserved.

Load  test  data  in  parallel

$ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar // Create DataFrame for the test data scala> val testDf = sqlContext.sparkSession.read.format("libsvm”) .load("E2006.test.bz2") scala> testDf.printSchema root |-- label: double (nullable = true) |-- features: vector (nullable = true)

Page 9: 20160908 hivemall meetup

9Copyright©2016 NTT corp. All Rights Reserved.

Load  test  data  in  parallel

0.000357499151147113 6066:0.0007932706219604 8 6069:0.000311377727123504 6070:0.0003067549 34580457 6071:0.000276992485786437 6072:0.000 39663531098024 6074:0.00039663531098024 6075 :0.00032548335…

testDf

Partition1 Partition2 Partition3 PartitionN

Load in parallel because bzip2 is splittable

• #partitions  depends  on  three  parameters•  spark.default.parallelism:  #cores  by  default•  spark.sql.files.maxPartitionBytes:  128MB  by  default•  spark.sql.files.openCostInBytes:  4MB  by  default

Page 10: 20160908 hivemall meetup

10Copyright©2016 NTT corp. All Rights Reserved.

• XGBoost  in  DataFrame•  Load  built  models  and  do  cross-‐‑‒joins  for  predictions

Do  predictions  in  parallel

scala> import org.apache.spark.hive.HivemallOps._ scala> :paste // Load built models from persistent storage val modelsDf = sqlContext.sparkSession.read.format(xgboost) .load(”xgboost_models_dir") // Do prediction in parallel via cross-joins val predict = modelsDf.join(testDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()

Page 11: 20160908 hivemall meetup

11Copyright©2016 NTT corp. All Rights Reserved.

• XGBoost  in  DataFrame•  Load  built  models  and  do  cross-‐‑‒joins  for  predictions

• Broadcast  cross-‐‑‒joins  expected•  Size  of  `̀modelsDf`̀  must  be  less  than  and  equal  to  spark.sql.autoBroadcastJoinThreshold  (10MB  by  default)

Do  predictions  in  parallel

testDf

rowid label features1 0.392 1:0.3  5:0.1…2 0.929 3:0.2…3 0.132 2:0.9…4 0.3923 5:0.4…

modelsDf

model_̲id pred_̲modelxgb_̲0001.model <binary  data>xgb_̲0002.model <binary  data>

cross-joins in parallel

Page 12: 20160908 hivemall meetup

12Copyright©2016 NTT corp. All Rights Reserved.

• Structured  Streaming  in  Spark-‐‑‒2.0•  Scalable  and  fault-‐‑‒tolerant  stream  processing  engine  built  on  the  Spark  SQL  engine

•  alpha  component  in  v2.0

Do  predictions  for  streaming  data

scala> :paste // Initialize streaming DataFrame val testStreamingDf = spark.readStream .format(”libsvm”) // Not supported in v2.0 … // Do prediction for streaming data val predict = modelsDf.join(testStreamingDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()

Page 13: 20160908 hivemall meetup

13Copyright©2016 NTT corp. All Rights Reserved.

• One  model  for  a  partition• WIP:  Build  models  with  different  parameters

Build  models  in  parallel

scala> :paste // Set options for XGBoost val xgbOptions = XGBoostOptions() .set("num_round", "10000") .set(“max_depth”, “32,48,64”) // Randomly selected by workers // Set # of models to output val numModels = 4 // Build models and save them in persistent storage trainDf.repartition(numModels) .train_xgboost_regr($“features”, $ “label”, s"${xgbOptions}") .write .format(xgboost) .save(”xgboost_models_dir”)

Page 14: 20160908 hivemall meetup

14Copyright©2016 NTT corp. All Rights Reserved.

• If  you  get  stuck  in  UnsatisfiedLinkError,  you  need  to  compile  a  binary  by  yourself

Compile  a  binary  on  your  platform  

$ mvn validate && mvn package -Pcompile-xgboost -Pspark-2.0 –DskipTests $ ls target hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2-with-dependencies.jar hivemall-core-0.4.2-rc.2.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2.jar hivemall-mixserv-0.4.2-rc.2-fat.jar hivemall-xgboost-0.4.2-rc.2.jar hivemall-nlp-0.4.2-rc.2-with-dependencies.jar hivemall-xgboost_0.60-0.4.2-rc.2-with-dependencies.jar hivemall-nlp-0.4.2-rc.2.jar hivemall-xgboost_0.60-0.4.2-rc.2.jar

Page 15: 20160908 hivemall meetup

15Copyright©2016 NTT corp. All Rights Reserved.

• Rabbit  integration  for  parallel  learning•  http://dmlc.cs.washington.edu/rabit.html

• Python  supports• spark.ml  interface  supports• Bundle  more  binaries  for  portability

• Windows  and  x86  platforms• Others?

Future  Work