Upload
andy-petrella
View
1.306
Download
5
Embed Size (px)
Citation preview
Distributed Machine Learning 101using Apache Spark from the Browser
Scala days 2015, Amsterdam
● what is Machine Learning?◦ Variables, Variance and Bias
◦ Model selection
● Why Spark for machine learning?
● Spark MLlib by exampes◦ Genomics clustering and classification example
● What for the future?◦ Streaming
◦ Human Learning
Outline
Andy Petrella
MathsscalaApache Spark
Spark NotebookTrainerData Banana
Xavier Tordoir
PhysicsBioinformatics
ScalaSpark
you cannot prove a vague theory is wrong
[…] Also, if the process of computing the consequences is indefinite, then with a little skill any experimental result can be made to look like the expected consequences.
—Richard Feynman [1964]
What is Machine Learning?Science with data
Surely You’re Joking Mr…
● Modelling without first principle…
What is Machine Learning?Overview
2nd law neither...
● Modelling without first principle…
What is Machine Learning?Overview
Machine learning you do with a Learning Machine
Take that Newton...
● Modelling without first principle…
● Modelling dependencies from the data
What is Machine Learning?Overview
With some “a priori” knowledge
● What is the problem?● Hypothesis?● Data Generation Process?● Collection and Preprocessing● Interpretation
What is Machine Learning?Learning Machine…
You still need a domain expert…
Like me!
LearningMachine
● Estimate dependencies from data
What is Machine Learning?Overview
Machine learning you do with a Learning Machine
SamplesGenerator
System
x
y
ỹ
z ?
LearningMachine
● Estimate dependencies from data
● Minimize a risk functional over the set given the data
What is Machine Learning?Overview
I like them so much in LaTeX2e
SamplesGenerator
System
x
y
ỹ
z ?
LearningMachine
● Regression: continuous output
○ Risk = Prediction error
● Classification: categorical output
○ Risk = Probability of misclassification
What is Machine Learning?Supervised learning
Lyfxw y-fxw2…
WTF?
What is Machine Learning?Unsupervised learning: no output
I like clusters, specially with roasted nuts
● Clustering
○ Risk = Error Distortion (distances to center)
● Density estimation (probability densities)
What is Machine Learning?Bias - Variance, Regression illustration
Playtime!
Notebook!
What is Machine Learning?Model selection
all work and no play makes Jack a dull boy
Model Complexity control: Resampling
Because we only see one sample of the universe
Replay it!
Spark for Machine Learning?Model selection
Enough theory boy!
f0f1f2
Spark for Machine Learning?Model selection
Enough theory boy!
f0f1f2
Spark for Machine Learning?Model selection
Enough theory boy!
f0f1f2F3
More Samples
Spark for Machine Learning?Model selection
Enough theory boy!
f0f1f2F3
More Samples
Spark for Machine Learning?Model selection
Enough theory boy!
f0f1f2F3
Bigger Samples
Spark for Machine Learning?Model selection
Enough theory boy!
f0f1f2F3
Bigger Samples
Spark for Machine Learning?Model selection
Nice flag
K-Fold
K = 4
GenomicsThe data
So… that’s what separates us huh?
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
GenomicsThe data
Please, don’t mind the colors...
1000 genomes: http://www.1000genomes.org/
~1000 samples
Few samples => Machine Learning
GenomicsThe data
Woooow, really, you must be kidding me… ahahahahah
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
Few samples => Machine Learning
Lots of Data => Distributed computing
GenomicsThe data
Oh… damned… hum huh
Data continues to flow
Models must be trained continuously
=> Streaming Machine learning algorithms
Models must be validated
=> Batch machine learning
→ ƛambda ML
What else?Streaming
Lambada?
Learning probabilistic models
Not only learning which features are important...
but also Learning interactions effectively explaining observations
What else?Probabilistic Programming
I’ll probably program too
That’s all folks
Roooaaar
Q / Option[A] / beers
THANKS!
Xavier Tordoir
@xtordoir
Andy Petrella
@noootsabhttp://data-fellas.guru https://github.com/andypetrella/spark-notebook/
Frank Nothaft
Matt Massie
Matt Gianni
Venkat Krishnamurthy
Look at the CodeThe browser part is powered by the Spark Notebook.
The 3 notebooks are:● mllib/Variance - Bias.snb● adam/Clustering Genomes using Adam with LDA.snb● adam/Classifying Genomes using Adam with RF.snb
So grab a Spark Notebook on http://spark-notebook.io/.
Yeaaaaah!