Upload
holden-karau
View
633
Download
1
Embed Size (px)
Citation preview
Introduction to Spark MLMachine learning at scale
ApacheCon 2016
Hella-Legit
Who am I?
Holden● I prefer she/her for pronouns● Co-author of the Learning Spark book● Software Engineer at IBM’s Spark Technology Center● @holdenkarau● http://www.slideshare.net/hkarau● https://www.linkedin.com/in/holdenkarau
What we are going to explore together!
● Who I think you all are● Spark’s two different ML APIs● Running through a simple example with one● A brief detour into some codegen funtimes● Exercises!● Model save/load● Discussion of “serving” options
The different pieces of Spark
Apache Spark
SQL & DataFrames Streaming Language
APIs
Scala, Java, Python, & R
Graph Tools
Spark ML bagel & Grah X
MLLib Community Packages
Who do I think you all are?
● Nice people*● Some knowledge of Apache Spark core & maybe SQL● Interested in using Spark for Machine Learning● Familiar-ish with Scala or Java or Python
Amanda
Skipping intro & set-up time :)
But maybe time to upgrade...
● Spark 1.5+ (Spark 1.6 would be best!)○ (built with Hive support if building from source)
Amanda
Some pages to keep open:
http://bit.ly/sparkDocs http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc http://bit.ly/sparkMLGuide https://github.com/holdenk/spark-intro-ml-pipeline-workshop http://www.slideshare.net/hkarau Download census data https://archive.ics.uci.edu/ml/datasets/Adult
Dwight Sipler
Getting some data for working with:
● census data: https://archive.ics.uci.edu/ml/datasets/Adult
● Goal: predict income > 50k● Also included in the github repo● Download that now if you haven’t already● We will add a header to the data
○ http://pastebin.ca/3318687
PROTill Westermayer
So what are the two APIs?
● Traditional and Pipeline○ Pipeline is the new shiny future which will fix all problems*
● Traditional API works on RDDs○ Data preparation work is generally done in traditional Spark
transformations● Pipeline API works on DataFrames
○ Often we want to apply some transformations to our data before feeding to the machine learning algorithm
○ Makes it easy to chain these together
(*until replaced by a newer shinier future)
Steve Jurvetson
So what are DataFrames?
● Spark SQL’s version of RDDs of the world (its for more than just SQL)
● Restricted data types, schema information, compile time untyped
● Restricted operations (more relational style)● Allow lots of fun extra optimizations
○ Tungsten etc.● We’ll talk more about them (& Datasets) when we do
the Spark SQL component of this workshop
Transformers, Estimators and Pipelines
● Transformers transform a DataFrame into another● Estimators can be trained on a DataFrame to produce a
transformer● Pipelines chain together multiple transformers and
estimators
Let’s start with loading some data
● We’ve got some CSV data, we could use textfile and parse by hand
● spark-packages can save by providing the spark-csv package by Hossein Falaki○ If we were building a Java project we can include maven coordinates○ For the Spark shell when launching add:
--packages com.databricks:spark-csv_2.10:1.3.0
Jess Johnson
Loading with sparkSQL & spark-csv
sqlContext.read returns a DataFrameReaderWe can specify general properties & data specific options● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema● format(“formatName”)
○ built in formats include parquet, jdbc, etc. today we will use com.databricks.spark.csv
● load(“path”)
Jess Johnson
Loading with sparkSQL & spark-csvdf = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson
Lets explore training a Decision Tree
● Step 1: Data loading (done!)● Step 2: Data prep (select features, etc.)● Step 3: Train● Step 4: Predict
Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type must be double)
● We need to train with a vector of featuresImports:from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml.feature import Bucketizer, VectorAssembler,
StringIndexer
from pyspark.ml import Pipeline
Huang Yun Chung
Data prep / cleaning continued# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
outputCol="feautres")
# String indexer converts a set of strings into doubles
indexer =
StringIndexer(inputCol="category")
.setOutputCol("category-index")
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
Huang Yun Chung
So a bit more about that pipeline
● Each of our previous components has “fit” & “transform” stage
● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey
What does our pipeline look like so far?
Input Data AssemblerInput Data + Vectors StringIndexer
Input Data+Cat ID+ Vectors
While not an ML learning algorithm this still needs to be fit
This is a regular transformer - no fitting required.
Let's train a model on our prepared data:# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)
And predict the results on the same data:pipeline_model.transform(df).select("prediction",
"category-index").take(20)
Exercise 1: Go from the index to something useful
● We could manually look up the labels and then write a select statement
● Or we could look at the features on the StringIndexerModel and use IndexToString
● Our pipeline has an array of stages we can use for this
Solution:from pyspark.ml.feature import IndexToString
labels = list(pipeline_model.stages[1].labels())
inverter = IndexToString(inputCol="prediction", outputCol="
prediction-label", labels=labels)
inverter.transform(pipeline_model.transform(df)).select
("prediction-label", "category").take(20)
# Pre Spark 1.6 use SQL if/else or similar
So what could we do for other types of data?
● org.apache.spark.ml.feature has a lot of options○ HashingTF○ Tokenizer○ Word2Vec○ etc.
Exercise 2: Add more features to your tree
● Finished quickly? Help others!● Or tell me if adding these features helped or not…
○ We can download a reserve “test” dataset but how would we know if we couldn’t do that?
cobra libre
And not just for getting data into doubles...
● Maybe a customers cat food preference only matters if the owns_cats boolean is true
● Maybe the scale is _way_ off● Maybe we’ve got stop words● Maybe we know one component has a non-linear
relation● etc.
Cross-validationbecause saving a test set is effort
● Automagically* fit your model params● Because thinking is effort● org.apache.spark.ml.tuning has the tools
○ (not in Python yet so skipping for now)
Jonathan Kotta
Pipeline API has many models:
● org.apache.spark.ml.classification○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc.
● org.apache.spark.ml.recommendation○ ALS
PROcarterse Follow
Exercise 3: Train a new model type
● Your choice!● If you want to do regression - change what we are
predicting
So serving...
● Generally refers to using your model online○ Generating recommendations...
● In batch mode you can “just” save & use the Spark bits● Spark’s “native” formats (often parquet w/metadata)
○ Understood by Spark libraries and thats pretty much it○ If you are serving in JVM can load these but need Spark
dependencies (albeit often not a Spark cluster)● Some models support PMML export
○ https://github.com/jpmml/openscoring etc.● We can also write our own export & serving by hand :(
Ambernectar 13
So what models are PMML exportable?
● Right now “old” style models○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary
LogisticRegression ○ However if we look in the code we can sometimes find converters to
the old style models and use this to export our “new” style model● Waiting on https://issues.apache.
org/jira/browse/SPARK-11171 / https://github.com/apache/spark/pull/9207 for pipeline models
● Not getting in for 2.0 :(
How to PMML export
toPMML● returns a string or● takes a path to local fs and saves results or● takes a SparkContext & a distributed path and saves or● takes a stream and writes result to stream
Optional* exercise time
● Take a model you trained and save it to PMML○ You will have to dig around in the Spark code to be able to do this
● Look at the file● Load it into a serving system and try some predictions● Note: PMML export currently only includes the model -
not any transformations beforehand● Also: you might need to train a new model● If you don’t get it don’t worry - hints to follow :)