Data science using Big Data. Pragmatic
approach.
Pavel Mesentsev Dmitri Babaev
Tinkoff Bank
Machine learning in business
Other ML tasks examples
Size of data
~ 1 mln obj
~ 1 bln obj
~ 1 trln obj
Small Data
Big Data
Huge Data
Predictive models design
Small data case
Big data case
Big data case
Apache Spark case
How does its works
How does its works
How does its works
How does its works
How does its works
How does its works
How does its works
Spark Sql
Spark Sql
Spark Sql
Spark Sql
Spark Sql
Spark Sql
Spark Sql
Spark Sql
Spark Sql
Apache Spark case
Apache Spark in
Large Scale Machine Learning
LSML issues
• Too much samples to classify
• Training data does not fit in memory
• Too much training samples
• Too much models to train
Why not MLLib?• MLLib is less stable
• too few algorithms comparing to scikit-learn
• ML pipelines are not so mature than in scikit-learn
• e. g. there is no simple way to use logistic regression for feature selection
• MLLib python API falls behind Java/Scala API
• MLLib is actively developed and may be feasible choice in near future
Spark + scikit-learn = ?• Parallel training
• meta-parameter grid search
• parallel one-vs-rest for multi-class models
• same features but different targets
• parallel bagging and ensembles
• parallel learning for multi-step classification
• Parallel prediction
Distributed learning
Distributed learning
Distributed learning
Distributed learning
Distributed learning
Distributed learning
Distributed learning
Distributed learning
Distributed learning
Distributed prediction
Distributed prediction
Distributed prediction
Distributed prediction
Distributed prediction
Distributed prediction
Distributed prediction
Models management
•A complex ML task can be expressed as scikit-learn
pipeline
• e. g. feature scaling, then LR feature selection then
GBT classification
• Trained models can be stored on HDFS / S3 along with
training report and loaded for prediction
Alternative approaches to LSML•partial / online learning
• e. g. stochastic gradient descent
•distributed stochastic gradient descent
• is implemented in MLLib
Spark is everywhere
•Mahout on spark aka Samsara
•H20 Sparkling water
Questions?
Pavel Mezentsev [email protected] Babaev [email protected]