Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process...

TonY: An Orchestrator for Distributed Machine Learning Jobs

Agenda● Background: TensorFlow and YARN● What is TonY?● Why use TonY for distributed training?● Next steps

Machine Learning process

• Productionizing machine learning requires many steps

• The focus of this talk will be model training

Data Ingestion

Data Preparation

Model Training

Model Deployment

Model Serving

BackgroundWhat is TensorFlow?

Visualisation with TensorBoardhttps://learningtensorflow.com/Visualisation/

Background

Worker task

Parameter server task

Worker + Parameter Server Model

Worker task

Ring All-Reduce Model

What is distributed TensorFlow?

How to run distributed TensorFlow?● Distribute code/data artifacts across multiple machines in distributed job● Allow tasks in the same distributed job to talk to each other (e.g. tell each

worker where all other worker/parameter servers are)● Ensure your task compute requirements are met before starting

distributed job● Or, have a framework do all of the above for you (Hadoop!)

BackgroundWhat is Hadoop?

Distributed File System

BackgroundWhat is Hadoop?

Distributed File System

Yet Another Resource Negotiator

BackgroundHow to work with YARN?

What is TonY?

What is TonY?● Orchestrates running distributed TensorFlow on Hadoop● Acquires compute resources from Hadoop (memory, CPU, GPU)● Sets up and launches distributed TensorFlow jobs on Hadoop clusters● Manages application

lifecycle○ Fault tolerance

○ Job monitoring

TonY Architecture

TonY Architecture● Entry point for TonY jobs● Package user’s configurations, user’s

model code and submit as YARN application

TonY Architecture● Job setup and lifecycle

management● Negotiates compute resources

from Hadoop● Sets up container environment● Launches and monitors containers

TonY Architecture● Container = Task Executor● Launches user’s provided python

script● Heartbeats to Application Master

for liveness

Why use TonY for distributed training?

Scaling distributed TensorFlow on Hadoop● Leverage YARN’s fine-grained resource management and multi-tenancy

○ Logical resource isolation via queues○ Hardware-based physical resource partitioning (CPU, K80, V100)○ User-based resource limits

Scaling distributed TensorFlow on Hadoop● Native GPU resource awareness● Ensures GPU resource isolation and scheduling

Scaling distributed TensorFlow on Hadoop● One-click TensorBoard access for monitoring training progress

Scaling distributed TensorFlow on Hadoop● Fault tolerance● More workers = more failures

Scaling distributed TensorFlow on Hadoop● Fault tolerance● More workers = more failures● First attempt periodically saves model checkpoints to HDFS

Scaling distributed TensorFlow on Hadoop● Fault tolerance● More workers = more failures● First attempt periodically saves model checkpoints to HDFS● Worker failure -> tear down and restart application

Scaling distributed TensorFlow on Hadoop● Fault tolerance● More workers = more failures● First attempt periodically saves model checkpoints to HDFS● Worker failure -> tear down and restart application● Read checkpoints from HDFS, resume from where previous attempt

left off

Open Sourced!● https://github.com/linkedin/TonY● Engineering blog post: https://bit.ly/2O6L5WD

Contributions Welcome!

Next steps● Dr. Elephant integration● TonY portal for notebook, job history, cross-execution monitoring

Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process...

Documents

Introduction To Azure Machine Learning...Expected Learning Outcomes Azure Machine Learning @tetranoodle Supervised Machine Learning Machine Learning Algorithms in Azure ML Workflow

Bayesian Methods for Machine Learning - Machine Learning (Theory)

Productionizing Spark and the Spark Job Server

Hadoop World 2010: Productionizing Hadoop: Lessons Learned

Machine Learning on Spark - UC Berkeley AMP Campampcamp.berkeley.edu/.../Machine-Learning-on-Spark... · Machine Learning on Spark Shivaram Venkataraman ... Machine learning algorithms

¿Qué es Machine Learning? Usos de Machine Learning

Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

Whitepaper - Productionizing Machine Learning Models with

5 Keys to "Productionizing" Big Data

Productionizing dl from the ground up

Machine Learning and Azure Machine Learning

Machine Learning - apiacoa.orgapiacoa.org/publications/teaching/machine-learning/machine... · Machine Learning Fabrice Rossi ... I chess program ... Machine programming I climbing

An Introduction to Machine Learning - GitHub Pages · Introduction Machine learning The machine learning process What’s machine learning History Supervised learning Non-supervised

Productionizing your Streaming Jobs

Methods MACHINE LEARNING FROM MACHINE LEARNING TO …

MathWorks - Introducing Machine Learning · 4 ntroducing Machine Learning How Machine Learning Works Machine learning uses two types of techniques: supervised learning, which trains

Webinar: Productionizing Hadoop: Lessons Learned - 20101208

PRODUCTIONIZING DEMAND FORECASTING MODEL

Machine Learning - 0. Overview€¦ · Machine Learning 1. What is Machine Learning? One area of research, many names (and aspects) machine learning historically, stresses learning

Machine Learning for NLP - Ethics and Machine Learning