Scalable Machine
Learning
me:
Software Engineer, Freelance
Big Data, Distributed Computing, Machine Learning
Paris Data Geek Co-organizer
@DataParis@samklr
Sam Bessalah
Machine Learning Land
VOWPAL WABBIT
Some Observations in Big Data Land
● New use cases push towards faster execution platforms and real
time predictions engines.
● Traditional MapReduce on Hadoop is fading away, especially for
Machine Learning
● Apache Spark has become the darling of the Big Data world,
thanks to its high level API and performances.
● Rise of Machine Learning public APIs to easily integrate models
into application and other data processing workflows.
● Used to be the only Hadoop MapReduce Framework
● Moved from MapReduce towards modern and faster
backends, namely
● Now provide a fluent DSL that integrates with Scala and
Spark
Mahout Example
Simple Co-occurence analysis in Mahout
val A = drmFromHDFS (“ hdfs://nivdul/babygirl.txt“)
val cooccurencesMatrix = A.t %*% A
val numInteractions = drmBroadcast(A.colsums)
val I = C.mapBlock(){case (keys, block) => val indicatorBlock = sparse(row, col)for (r <- block )
indicatorBlock = computeLLR (row, nbInt)
keys <- indicatorblock }
Dataflow system, materialized by immutable and lazy, in-memory distributed
collections suited for iterative and complex transformations, like in most Machine
Learning algorithms.
Those in-memory collections are called Resilient Distributed Datasets (RDD)
They provide :
● Partitioned data
● High level operations (map, filter, collect, reduce, zip, join, sample, etc …)
● No side effects
● Fault recovery via lineage
Some operations on RDDs
Spark
Ecosystem
MLlib
Machine Learning library within Spark :
● Provides an integrated predictive and data analysis
workflow
● Broad collections of algorithms and applications
● Integrates with the whole Spark Ecosystem
Three APIs in :
Algorithms in MLlib
Example: Clustering via K-means// Load and parse dataval data = sc.textFile(“hdfs://bbgrl/dataset.txt”)val parsedData = data.map { x =>
Vectors.dense(x.split(“ “).map.(_.toDouble ))}.cache()
//Cluster data into 5 classes using K-meansval clusters = Kmeans.train(parsedData, k=5, numIterations=20 )
//Evaluate model errorval cost = clusters.computeCost(parsedData)
Coming to Spark 1.2
● Ensembles of decision trees : Random Forests
● Boosting
● Topic modeling
● Streaming Kmeans
● A pipeline interface for machine workflows
A lot of contributions from the community
Machine Learning PipelineTypical machine learning workflows are complex !
Coming in next iterations of MLLib
● H20 is a fast (really fast), statistics, Machine Learning
and maths engine on the JVM.
● Edited by 0xdata (commercial entity) and focus on
bringing robust and highly performant machine learning
algorithms to popular Big Data workloads.
● Has APIs in R, Java, Scala and Python and integrates
to third parties tools like Tableau and Excel.
Example in R
library(h2o)localH2O = h2o.init(ip = 'localhost', port = 54321)irisPath = system.file("extdata", "iris.csv", package="h2o")
iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex")iris.data.frame <- as.data.frame(iris.hex)> colnames(iris.hex)[1] "C1" "C2" "C3" "C4" "C5">
Simple Logistic Regressioon to predict prostate cancer outcomes:
> prostate.hex = h2o.importFile(localH2O, path="https://raw.github.com/0xdata/h2o/../prostate.csv",key = "prostate.hex")
> prostate.glm = h2o.glm(y = "CAPSULE", x =c("AGE","RACE","PSA","DCAPS"),data = prostate.hex,family = "binomial", nfolds = 10, alpha = 0.5)
> prostate.fit = h2o.predict(object=prostate.glm, newdata = prostate.hex)
> (prostate.fit)IP Address: 127.0.0.1
Port : 54321Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9
predict X0 X11 0 0.7452267 0.25477322 1 0.3969807 0.60301933 1 0.4120950 0.58790504 1 0.3726134 0.62738665 1 0.6465137 0.35348636 1 0.4331880 0.5668120
Sparkling Water
Transparent use of H2O data and algorithms with the Spark API.
Provides a custom RDD : H2ORDD
val sqlContext = new SQLContext(sc)import sqlContext._airlinesTable.registerTempTable("airlinesTable") //H20 methodsval query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest
LIKE 'SJC' OR Dest LIKE 'OAK'“
val result = sql(query) result.count
Same but with Spark API
// H2O Context provide useful implicits for conversionsval h2oContext = new H2OContext(sc)import h2oContext._// Create RDD wrapper around DataFrameval airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)airlinesTable.count// And use Spark RDD API directlyval flightsOnlyToSF = airlinesTable.filter(f =>
f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") )flightsOnlyToSF.count
Build a modelimport hex.deeplearning._import hex.deeplearning.DeepLearningModel.DeepLearningParameters
val dlParams = new DeepLearningParameters()dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, DayOfWeek,
'CRSDepTime, 'CRSArrTime,'UniqueCarrier,FlightNum, 'TailNum, 'CRSElapsedTime,'Origin, 'Dest,'Distance,‘IsDepDelayed)
dlParams.response_column = 'IsDepDelayed.name
// Create a new model builderval dl = new DeepLearning(dlParams)val dlModel = dl.train.get
Predict
// Use model to score dataval prediction = dlModel.score(result)(‘predict)
// Collect predicted values via the RDD APIval predictionValues = toRDD[DoubleHolder](prediction)
.collect
.map ( _.result.getOrElse("NaN") )
Slides: http://speakerdeck.com/samklr/