Introduction to Spark ML Pipelines Workshop

Introduction to Spark MLMachine learning at scale

ApacheCon 2016

Hella-Legit

Who am I?

Holden● I prefer she/her for pronouns● Co-author of the Learning Spark book● Software Engineer at IBM’s Spark Technology Center● @holdenkarau● http://www.slideshare.net/hkarau● https://www.linkedin.com/in/holdenkarau

https://twitter.com/holdenkarau

https://twitter.com/holdenkarau

http://www.slideshare.net/hkarau


https://www.linkedin.com/in/holdenkarau

https://www.linkedin.com/in/holdenkarau

What we are going to explore together!

● Who I think you all are● Spark’s two different ML APIs● Running through a simple example with one● A brief detour into some codegen funtimes● Exercises!● Model save/load● Discussion of “serving” options

The different pieces of Spark

Apache Spark

SQL & DataFrames Streaming Language

APIs

Scala, Java, Python, & R

Graph Tools

Spark ML bagel & Grah X

MLLib Community Packages

Who do I think you all are?

● Nice people*● Some knowledge of Apache Spark core & maybe SQL● Interested in using Spark for Machine Learning● Familiar-ish with Scala or Java or Python

Amanda

Skipping intro & set-up time :)

But maybe time to upgrade...

● Spark 1.5+ (Spark 1.6 would be best!)○ (built with Hive support if building from source)

Amanda

Some pages to keep open:

http://bit.ly/sparkDocs http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc http://bit.ly/sparkMLGuide https://github.com/holdenk/spark-intro-ml-pipeline-workshop http://www.slideshare.net/hkarau Download census data https://archive.ics.uci.edu/ml/datasets/Adult

Dwight Sipler

http://bit.ly/sparkDocs

http://bit.ly/sparkDocs

http://bit.ly/sparkPyDocs

http://bit.ly/sparkScalaDoc

http://bit.ly/sparkPyDocs

http://bit.ly/sparkMLGuide

http://bit.ly/sparkMLGuide

https://github.com/holdenk/spark-intro-ml-pipeline-workshop





https://archive.ics.uci.edu/ml/datasets/Adult



Getting some data for working with:

● census data: https://archive.ics.uci.edu/ml/datasets/Adult

● Goal: predict income > 50k● Also included in the github repo● Download that now if you haven’t already● We will add a header to the data

○ http://pastebin.ca/3318687

PROTill Westermayer




http://pastebin.ca/3318687

http://pastebin.ca/3318687

So what are the two APIs?

● Traditional and Pipeline○ Pipeline is the new shiny future which will fix all problems*

● Traditional API works on RDDs○ Data preparation work is generally done in traditional Spark

transformations● Pipeline API works on DataFrames

○ Often we want to apply some transformations to our data before feeding to the machine learning algorithm

○ Makes it easy to chain these together

(*until replaced by a newer shinier future)

Steve Jurvetson

So what are DataFrames?

● Spark SQL’s version of RDDs of the world (its for more than just SQL)

● Restricted data types, schema information, compile time untyped

● Restricted operations (more relational style)● Allow lots of fun extra optimizations

○ Tungsten etc.● We’ll talk more about them (& Datasets) when we do

the Spark SQL component of this workshop

Transformers, Estimators and Pipelines

● Transformers transform a DataFrame into another● Estimators can be trained on a DataFrame to produce a

transformer● Pipelines chain together multiple transformers and

estimators

Let’s start with loading some data

● We’ve got some CSV data, we could use textfile and parse by hand

● spark-packages can save by providing the spark-csv package by Hossein Falaki○ If we were building a Java project we can include maven coordinates○ For the Spark shell when launching add:

--packages com.databricks:spark-csv_2.10:1.3.0

Jess Johnson

http://spark-packages.org/

https://github.com/databricks/spark-csv

https://github.com/falaki

Loading with sparkSQL & spark-csv

sqlContext.read returns a DataFrameReaderWe can specify general properties & data specific options● option(“key”, “value”)

○ spark-csv ones we will use are header & inferSchema● format(“formatName”)

○ built in formats include parquet, jdbc, etc. today we will use com.databricks.spark.csv

● load(“path”)

Jess Johnson

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader

Loading with sparkSQL & spark-csvdf = sqlContext.read

.format("com.databricks.spark.csv")

.option("header", "true")

.option("inferSchema", "true")

.load("resources/adult.data")

Jess Johnson

Lets explore training a Decision Tree

● Step 1: Data loading (done!)● Step 2: Data prep (select features, etc.)● Step 3: Train● Step 4: Predict

Data prep / cleaning

● We need to predict a double (can be 0.0, 1.0, but type must be double)

● We need to train with a vector of featuresImports:from pyspark.mllib.linalg import Vectors

from pyspark.ml.classification import DecisionTreeClassifier

from pyspark.ml.param import Param, Params

from pyspark.ml.feature import Bucketizer, VectorAssembler,

StringIndexer

from pyspark.ml import Pipeline

Huang Yun Chung

Data prep / cleaning continued# Combines a list of double input features into a vector

assembler = VectorAssembler(inputCols=["age", "education-num"],

outputCol="feautres")

# String indexer converts a set of strings into doubles

indexer =

StringIndexer(inputCol="category")

.setOutputCol("category-index")

# Can be used to combine pipeline components together

pipeline = Pipeline().setStages([assembler, indexer])

Huang Yun Chung

So a bit more about that pipeline

● Each of our previous components has “fit” & “transform” stage

● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform)

● Can re-use the fitted model on future data

model=pipeline.fit(df)

prepared = model.transform(df)

Andrey

What does our pipeline look like so far?

Input Data AssemblerInput Data + Vectors StringIndexer

Input Data+Cat ID+ Vectors

While not an ML learning algorithm this still needs to be fit

This is a regular transformer - no fitting required.

Let's train a model on our prepared data:# Specify model

dt = DecisionTreeClassifier(labelCol = "category-index",

featuresCol="features")

# Fit it

dt_model = dt.fit(prepared)

# Or as part of the pipeline

pipeline_and_model = Pipeline().setStages([assembler, indexer,

dt])

pipeline_model = pipeline_and_model.fit(df)

And predict the results on the same data:pipeline_model.transform(df).select("prediction",

"category-index").take(20)

Exercise 1: Go from the index to something useful

● We could manually look up the labels and then write a select statement

● Or we could look at the features on the StringIndexerModel and use IndexToString

● Our pipeline has an array of stages we can use for this

Solution:from pyspark.ml.feature import IndexToString

labels = list(pipeline_model.stages[1].labels())

inverter = IndexToString(inputCol="prediction", outputCol="

prediction-label", labels=labels)

inverter.transform(pipeline_model.transform(df)).select

("prediction-label", "category").take(20)

# Pre Spark 1.6 use SQL if/else or similar

So what could we do for other types of data?

● org.apache.spark.ml.feature has a lot of options○ HashingTF○ Tokenizer○ Word2Vec○ etc.

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.package

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.package

Exercise 2: Add more features to your tree

● Finished quickly? Help others!● Or tell me if adding these features helped or not…

○ We can download a reserve “test” dataset but how would we know if we couldn’t do that?

cobra libre

And not just for getting data into doubles...

● Maybe a customers cat food preference only matters if the owns_cats boolean is true

● Maybe the scale is _way_ off● Maybe we’ve got stop words● Maybe we know one component has a non-linear

relation● etc.

Cross-validationbecause saving a test set is effort

● Automagically* fit your model params● Because thinking is effort● org.apache.spark.ml.tuning has the tools

○ (not in Python yet so skipping for now)

Jonathan Kotta

Pipeline API has many models:

● org.apache.spark.ml.classification○ BinaryLogisticRegressionClassification, DecissionTreeClassification,

GBTClassifier, etc.● org.apache.spark.ml.regression

○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc.

● org.apache.spark.ml.recommendation○ ALS

PROcarterse Follow

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.classification.package

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.classification.package

http://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/regression/package.html

http://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/regression/package.html

http://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/recommendation/package.html

http://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/recommendation/package.html

Exercise 3: Train a new model type

● Your choice!● If you want to do regression - change what we are

predicting

So serving...

● Generally refers to using your model online○ Generating recommendations...

● In batch mode you can “just” save & use the Spark bits● Spark’s “native” formats (often parquet w/metadata)

○ Understood by Spark libraries and thats pretty much it○ If you are serving in JVM can load these but need Spark

dependencies (albeit often not a Spark cluster)● Some models support PMML export

○ https://github.com/jpmml/openscoring etc.● We can also write our own export & serving by hand :(

Ambernectar 13

https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language

https://github.com/jpmml/openscoring

https://github.com/jpmml/openscoring

So what models are PMML exportable?

● Right now “old” style models○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary

LogisticRegression ○ However if we look in the code we can sometimes find converters to

the old style models and use this to export our “new” style model● Waiting on https://issues.apache.

org/jira/browse/SPARK-11171 / https://github.com/apache/spark/pull/9207 for pipeline models

● Not getting in for 2.0 :(

https://issues.apache.org/jira/browse/SPARK-11171

https://issues.apache.org/jira/browse/SPARK-11171

https://github.com/apache/spark/pull/9207



How to PMML export

toPMML● returns a string or● takes a path to local fs and saves results or● takes a SparkContext & a distributed path and saves or● takes a stream and writes result to stream

Optional* exercise time

● Take a model you trained and save it to PMML○ You will have to dig around in the Spark code to be able to do this

● Look at the file● Load it into a serving system and try some predictions● Note: PMML export currently only includes the model -

not any transformations beforehand● Also: you might need to train a new model● If you don’t get it don’t worry - hints to follow :)

Data & Analytics

Introduction to Spark ML Pipelines Workshop