33

From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

  • Upload
    others

  • View
    44

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software
Page 2: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy

Moritz Meister, @morimeisterSoftware Engineer, Logical Clocks

Jim Dowling, @jim_dowlingAssociate Professor, KTH Royal Institute of Technology

Page 3: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

ML Model DevelopmentA simplified view

Exploration Experimentation Model TrainingExplainability and Validation ServingFeature

Pipelines

Page 4: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

ML Model Development

Exploreand Design

Experimentation: Tune and Search

Model Training(Distributed)

Explainability and Ablation Studies

It’s simple - only four steps

Page 5: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Artifacts and Non DRY Code

Exploreand Design

Experimentation: Tune and Search

Model Training(Distributed)

Explainability and Ablation Studies

Page 6: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

What It’s Really Like… not linear but iterative

Page 7: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

What It’s Really Really Like… not linear but iterative

Page 8: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Root Cause: Iterative Development of ML Models

Exploreand Design

Experimentation: Tune and Search

Model Training(Distributed)

Explainability and Ablation Studies

Page 9: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Ablation StudiesEDA HParam Tuning Training (Dist)

Iterative Development Is a Pain, We Need DRY Code!Each step requires different implementations of the training code

Page 10: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

OBLIVIOUS TRAINING FUNCTION

# RUNS ON THE WORKERS def train():def input_fn(): # return datasetmodel = …optimizer = …model.compile(…)rc = tf.estimator.RunConfig(

‘CollectiveAllReduceStrategy’)keras_estimator = tf.keras.estimator.

model_to_estimator(….)tf.estimator.train_and_evaluate(

keras_estimator, input_fn)

Ablation StudiesEDA HParam Tuning Training (Dist)

The Oblivious Training Function

Page 11: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Challenge: Obtrusive Framework Artifacts

▪ TF_CONFIG▪ Distribution Strategy▪ Dataset (Sharding, DFS)▪ Integration in Python - hard from inside a notebook▪ Keras vs. Estimator vs. Custom Training Loop

Example: TensorFlow

Page 12: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Where is Deep Learning headed?

Page 13: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Productive High-Level APIsOr why data scientists love Keras and PyTorch

Idea

Experiment

ResultsInfrastructure

Framework

TrackingVisualization

Francois Chollet, “Keras: The Next 5 Years”

Page 14: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Productive High-Level APIsOr why data scientists love Keras and PyTorch

Idea

Experiment

ResultsInfrastructure

Framework

TrackingVisualization

Francois Chollet, “Keras: The Next 5 Years”

? Hopsworks (Open Source)DatabricksApache SparkCloud Providers

Page 15: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

How do we keep our high-level APIs transparent and productive?

Page 16: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

What Is Transparent Code?

def dataset(batch_size):(x_train, y_train) = load_data()x_train = x_train / np.float32(255)y_train = y_train.astype(np.int64)train_dataset = tf.data.Dataset.from_tensor_slices((x_train,y_train)).shuffle(60000).repeat().batch(batch_size)

return train_dataset

def build_and_compile_cnn_model(lr):model = tf.keras.Sequential([

tf.keras.Input(shape=(28, 28)),tf.keras.layers.Conv2D(32, 3, activation='relu'),tf.keras.layers.Flatten(),tf.keras.layers.Dense(128, activation='relu'),tf.keras.layers.Dense(10)

])model.compile(

loss=SparseCategoricalCrossentropy(from_logits=True),optimizer=SGD(learning_rate=lr))

return model

def dataset(batch_size):(x_train, y_train) = load_data()x_train = x_train / np.float32(255)y_train = y_train.astype(np.int64)train_dataset = tf.data.Dataset.from_tensor_slices((x_train,y_train)).shuffle(60000).repeat().batch(batch_size)

return train_dataset

def build_and_compile_cnn_model(lr):model = tf.keras.Sequential([

tf.keras.Input(shape=(28, 28)),tf.keras.layers.Conv2D(32, 3, activation='relu'),tf.keras.layers.Flatten(),tf.keras.layers.Dense(128, activation='relu'),tf.keras.layers.Dense(10)

])model.compile(

loss=SparseCategoricalCrossentropy(from_logits=True),optimizer=SGD(learning_rate=lr))

return model

NO CHANGES!

Page 17: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Building Blocks for Distribution Transparency

Page 18: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Distribution ContextSingle-host vs. parallel multi-host vs. distributed multi-host

Worker 1

Worker 5

Worker 3

Worker 2

Worker 4

Worker 7

Worker 8

Worker 6

DriverTF_CONFIG

DriverExperiment Controller

Worker 1 Worker NWorker 2

Single Host

Page 19: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Distribution ContextSingle-host vs. parallel multi-host vs. distributed multi-host

Worker 1

Worker 5

Worker 3

Worker 2

Worker 4

Worker 7

Worker 8

Worker 6

DriverTF_CONFIG

DriverExperiment Controller

Worker 1 Worker NWorker 2

Single Host

Exploreand Design

Experimentation: Tune and Search

Model Training(Distributed)

Explainability and Ablation Studies

Page 20: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Model Development Best Practices

▪ Modularize▪ Parametrize▪ Higher order training

functions▪ Usage of callbacks at

runtime

DatasetGeneration

Model Generation

TrainingLogic

Page 21: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Oblivious Training Function as an AbstractionLet the system handle the complexities

System takes care of ...

… fixing parameters… launching

the function

… launching trials (parametrized instantiations of the function)

… generating new trials… collecting and logging results

… setting up TF_CONFIG… wrapping in Distribution Strategy… launching function as workers… collecting results

Page 22: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Maggy

Spark+AI Summit 2019

TodayWith Hopsworks and Maggy, we provide a unified development and execution environment for distribution transparent ML model development.

Make the Oblivious Training Function a core abstraction on Hopsworks

Page 23: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Hopsworks - Award Winning Plattform

Page 24: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Recap: Maggy - Asynchronous Trials on SparkSpark is bulk-synchronous

WastedCompute

WastedCompute

HopsFS

Barrier

Task11

Task12

Task13

Task1N

Driver

Metrics1

Barrier

Task21

Task22

Task23

Task2N

Metrics2

BarrierTask31

Task32

Task33

Task3N

Metrics3

WastedCompute

Early-Stopping

Page 25: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Recap: The SolutionAdd Communication and Long Running Tasks

Task11

Task12

Task13

Task1N

Driver

Barrier

Metrics New Trial

Page 26: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

What’s New?Worker discovery and distribution context set-up

Task11

Task12

Task13

Task1N

Driver

Barrier

Launch Oblivious Training Function in Context

Discover Workers

Page 27: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

What’s New: Distribution Context

sp = maggy.optimization.Searchspace(...)dist_strat = tf.keras.distribute.MirroredStrategy(...)

ab = maggy.ablation.AblationStudy(...)

maggy.set_context('optimization’)maggy.lagom(training_function, sp)

maggy.set_context(‘distributed_training’)maggy.lagom(training_function, dist_strat)

maggy.set_context(‘ablation’)maggy.lagom(training_function, ab)

Page 28: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

DEMO

Page 29: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

What’s Next

Extend the platform to provide a unified development and execution environment for distribution transparent Jupyter Notebooks.

Page 30: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Summary

▪ Moving between distribution contexts requires code rewriting▪ Factor out obtrusive framework artifacts▪ Let system handle distribution context▪ Keep productive high-level APIs

Page 31: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Thank You!

Get Startedhopsworks.aigithub.com/logicalclocks/maggy

Twitter@morimeister@jim_dowling@logicalclocks@hopsworks

Webwww.logicalclocks.com

Contributions from colleagues▪ Sina Sheikholeslami▪ Robin Andersson▪ Alex Ormenisan▪ Kai Jeggle

Thanks to the Logical Clocks Team!

Page 32: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software

Feedback

Your feedback is important to us.

Don’t forget to rate and review the sessions.

Page 33: From Python to PySpark and Back Again · From Python to PySpark and Back Again -Unifying Single-host and Distributed Machine Learning with Maggy Moritz Meister, @morimeister Software