30
TonY: An Orchestrator for Distributed Machine Learning Jobs 1

Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

TonY: An Orchestrator for Distributed Machine Learning Jobs

1

Page 2: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Agenda● Background: TensorFlow and YARN● What is TonY?● Why use TonY for distributed training?● Next steps

2

Page 3: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Machine Learning process

• Productionizing machine learning requires many steps

• The focus of this talk will be model training

3

Data Ingestion

Data Preparation

Model Training

Model Deployment

Model Serving

Page 4: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

BackgroundWhat is TensorFlow?

4

Page 5: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

BackgroundWhat is TensorFlow?

Visualisation with TensorBoardhttps://learningtensorflow.com/Visualisation/

5

Page 6: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Background

Worker task

Worker task

Worker task

Parameter server task

Parameter server task

Worker + Parameter Server Model

Worker task

Worker task

Worker task

Worker task

Ring All-Reduce Model

What is distributed TensorFlow?

Page 7: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

How to run distributed TensorFlow?● Distribute code/data artifacts across multiple machines in distributed job● Allow tasks in the same distributed job to talk to each other (e.g. tell each

worker where all other worker/parameter servers are)● Ensure your task compute requirements are met before starting

distributed job● Or, have a framework do all of the above for you (Hadoop!)

7

Page 8: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

BackgroundWhat is Hadoop?

8

Page 9: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

BackgroundWhat is Hadoop?

9

Distributed File System

Page 10: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

BackgroundWhat is Hadoop?

10

Distributed File System

Yet Another Resource Negotiator

Page 11: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

BackgroundHow to work with YARN?

11

Page 12: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

BackgroundHow to work with YARN?

12

Page 13: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

BackgroundHow to work with YARN?

13

Page 14: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

What is TonY?

14

Page 15: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

What is TonY?● Orchestrates running distributed TensorFlow on Hadoop● Acquires compute resources from Hadoop (memory, CPU, GPU)● Sets up and launches distributed TensorFlow jobs on Hadoop clusters● Manages application

lifecycle○ Fault tolerance

○ Job monitoring

15

Page 16: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

TonY Architecture

16

Page 17: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

TonY Architecture● Entry point for TonY jobs● Package user’s configurations, user’s

model code and submit as YARN application

17

Page 18: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

TonY Architecture● Job setup and lifecycle

management● Negotiates compute resources

from Hadoop● Sets up container environment● Launches and monitors containers

18

Page 19: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

TonY Architecture● Container = Task Executor● Launches user’s provided python

script● Heartbeats to Application Master

for liveness

19

Page 20: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Why use TonY for distributed training?

20

Page 21: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Scaling distributed TensorFlow on Hadoop● Leverage YARN’s fine-grained resource management and multi-tenancy

○ Logical resource isolation via queues○ Hardware-based physical resource partitioning (CPU, K80, V100)○ User-based resource limits

21

Page 22: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Scaling distributed TensorFlow on Hadoop● Native GPU resource awareness● Ensures GPU resource isolation and scheduling

22

Page 23: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Scaling distributed TensorFlow on Hadoop● One-click TensorBoard access for monitoring training progress

23

Page 24: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Scaling distributed TensorFlow on Hadoop● Fault tolerance● More workers = more failures

24

Page 25: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Scaling distributed TensorFlow on Hadoop● Fault tolerance● More workers = more failures● First attempt periodically saves model checkpoints to HDFS

25

Page 26: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Scaling distributed TensorFlow on Hadoop● Fault tolerance● More workers = more failures● First attempt periodically saves model checkpoints to HDFS● Worker failure -> tear down and restart application

26

Page 27: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Scaling distributed TensorFlow on Hadoop● Fault tolerance● More workers = more failures● First attempt periodically saves model checkpoints to HDFS● Worker failure -> tear down and restart application● Read checkpoints from HDFS, resume from where previous attempt

left off

27

Page 28: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Open Sourced!● https://github.com/linkedin/TonY● Engineering blog post: https://bit.ly/2O6L5WD

Contributions Welcome!

28

Page 29: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Next steps● Dr. Elephant integration● TonY portal for notebook, job history, cross-execution monitoring

29

Page 30: Learning Jobs TonY: An Orchestrator for Distributed Machine...Machine Learning process •Productionizing machine learning requires many steps •The focus of this talk will be model

Q & A

30