Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Ray, Distributed Computing, and Machine Learning
Robert Nishihara
11/15/2008
The Machine Learning Ecosystem
Machine Learning Ecosystem
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training Model Serving RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training Model Serving
Hyperparameter SearchRL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training Model Serving RL Hyperparameter
Search
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training Data Processing
Model Serving
Hyperparameter SearchRL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training Model Serving
Hyperparameter SearchRL Data
Processing
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training Model Serving
Hyperparameter SearchRL Data
Processing
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training Data ProcessingStreamingModel
ServingHyperparameter
SearchRL
Machine Learning Ecosystem
Distributed System
DistributedSystem
Distributed System
Distributed System
The Machine Learning Ecosystem
Training Data ProcessingStreaming RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Machine Learning Ecosystem
Distributed System
Distributed System
Distributed System
Distributed System
The Machine Learning Ecosystem
Training Data ProcessingStreaming RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF,
Parameter Server
Clipper, TensorFlow
Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce,
ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Distributed System
Distributed System
Distributed System
Distributed System
The Machine Learning Ecosystem
Training Data ProcessingStreaming RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF,
Parameter Server
Clipper, TensorFlow
Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce,
ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Why distributed systems?
Distributed System
Distributed System
Distributed System
Distributed System
The Machine Learning Ecosystem
Training Data ProcessingStreaming RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF,
Parameter Server
Clipper, TensorFlow
Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce,
ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Why distributed systems?● More work (computation) than one
machine can do in a reasonable amount of time.
● More data than can fit in one machine.
Machine Learning Ecosystem
Distributed System
Distributed System
Distributed System
Distributed System
The Machine Learning Ecosystem
Training Data ProcessingStreaming RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF,
Parameter Server
Clipper, TensorFlow
Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce,
ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Aspects of a distributed system Why distributed systems?● More work (computation) than one
machine can do in a reasonable amount of time.
● More data than can fit in one machine.
Machine Learning Ecosystem
Distributed System
Distributed System
Distributed System
Distributed System
The Machine Learning Ecosystem
Training Data ProcessingStreaming RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF,
Parameter Server
Clipper, TensorFlow
Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce,
ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Aspects of a distributed system● Units of work “tasks” executed in parallel● Scheduling (which tasks run on which machines and when)● Data transfer● Failure handling● Resource management (CPUs, GPUs, memory)
Why distributed systems?● More work (computation) than one
machine can do in a reasonable amount of time.
● More data than can fit in one machine.
Machine Learning Ecosystem
What is Ray?
Training Data ProcessingStreaming RLModel
ServingHyperparameter
Search
Aspects of a distributed system● Units of work “tasks” executed in parallel● Scheduling (which tasks run on which machines and when)● Data transfer● Failure handling● Resource management (CPUs, GPUs, memory)
Why distributed systems?● More work (computation) than one
machine can do in a reasonable amount of time.
● More data than can fit in one machine.
Machine Learning Ecosystem
Distributed System (Ray)
What is Ray?
Libraries
Training Data ProcessingStreaming RLModel
ServingHyperparameter
Search
Aspects of a distributed system● Units of work “tasks” executed in parallel● Scheduling (which tasks run on which machines and when)● Data transfer● Failure handling● Resource management (CPUs, GPUs, memory)
Why distributed systems?● More work (computation) than one
machine can do in a reasonable amount of time.
● More data than can fit in one machine.
Distributed System
Distributed System
Distributed System
Distributed System
Training Data ProcessingStreaming RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF,
Parameter Server
Clipper, TensorFlow
Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce,
ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Distributed System (Ray)
Libraries
Training Data ProcessingStreaming RLModel
ServingHyperparameter
Search
Distributed System
Distributed System
Distributed System
Distributed System
Training Data ProcessingStreaming RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF,
Parameter Server
Clipper, TensorFlow
Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce,
ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Distributed System (Ray)
Libraries
Training Data ProcessingStreaming RLModel
ServingHyperparameter
Search
Distributed System
Distributed System
Distributed System
Distributed System
Training Data ProcessingStreaming RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF,
Parameter Server
Clipper, TensorFlow
Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce,
ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
This requires a very general underlying distributed system.
Distributed System (Ray)
Libraries
Training Data ProcessingStreaming RLModel
ServingHyperparameter
Search
Distributed System
Distributed System
Distributed System
Distributed System
Training Data ProcessingStreaming RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF,
Parameter Server
Clipper, TensorFlow
Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce,
ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
This requires a very general underlying distributed system.
Generality comes from tasks (functions) and actors (classes).
31 ©2017 RISELab
def zeros(shape): return np.zeros(shape)
def dot(a, b): return np.dot(a, b)
Ray API
32 ©2017 RISELab
@ray.remotedef zeros(shape): return np.zeros(shape)
@ray.remotedef dot(a, b): return np.dot(a, b)
Ray APITasks
33 ©2017 RISELab
@ray.remotedef zeros(shape): return np.zeros(shape)
@ray.remotedef dot(a, b): return np.dot(a, b)
id1 = zeros.remote([5, 5])
Ray APITasks
id1
zeros
34 ©2017 RISELab
@ray.remotedef zeros(shape): return np.zeros(shape)
@ray.remotedef dot(a, b): return np.dot(a, b)
id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])
Ray APITasks
id1 id2
zeros zeros
35 ©2017 RISELab
@ray.remotedef zeros(shape): return np.zeros(shape)
@ray.remotedef dot(a, b): return np.dot(a, b)
id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)
Ray APITasks
id1 id2
id3
zeros zeros
dot
36 ©2017 RISELab
@ray.remotedef zeros(shape): return np.zeros(shape)
@ray.remotedef dot(a, b): return np.dot(a, b)
id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)ray.get(id3)
Ray APITasks
id1 id2
id3
zeros zeros
dot
37 ©2017 RISELab
@ray.remotedef zeros(shape): return np.zeros(shape)
@ray.remotedef dot(a, b): return np.dot(a, b)
id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)ray.get(id3)
Ray APITasks
id1 id2
id3
zeros zeros
dot
Show in Jupyter notebook!
38 ©2017 RISELab
class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value
Ray APIActors
39 ©2017 RISELab
@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value
Ray APIActors
40 ©2017 RISELab
@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value
c = Counter.remote()
Ray APIActors
Counter
41 ©2017 RISELab
@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value
c = Counter.remote()id4 = c.inc.remote()
Ray APIActors
id4inc
Counter
42 ©2017 RISELab
@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value
c = Counter.remote()id4 = c.inc.remote()id5 = c.inc.remote()
Ray APIActors
id4inc
Counter
inc id5
43 ©2017 RISELab
@ray.remote(num_gpus=1)class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value
c = Counter.remote()id4 = c.inc.remote()id5 = c.inc.remote()ray.get([id4, id5])
Ray APIActors
id4inc
Counter
inc id5
How does this work under the hood?
How does this work under the hood?
@ray.remotedef zeros(shape): return np.zeros(shape)
@ray.remotedef dot(a, b): return np.dot(a, b)
id1 = zeros.remote([5, 5])id2 = zeros.remote([5, 5])id3 = dot.remote(id1, id2)ray.get(id3)
Tasks
id1 id2
id3
zeros zeros
dot
How does this work under the hood?
id1 id2
id3
zeros zeros
dot
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
zeroszeros
dot
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
zeros
dot
zeros
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
zeroszeros
dot
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
zeroszeros
obj2obj1
dot
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
obj2obj1
dot
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
obj2obj1
dot
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
obj2obj1
dot
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
obj2obj1
dot
obj2
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
obj2obj1
dot
obj2
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
obj2obj1
dot
obj2obj3
WorkerWorker
Object Store
Scheduler
Global Control StoreGlobal Control Store
WorkerWorker
Object Store
Scheduler
Global Control Store
The Ray Architecture
WorkerWorker
Object Store
Scheduler
id1 id2
id3
zeros zeros
dot
obj2obj1
dot
obj2obj3
id1
zeros
id2
zeros
Parallelism Example: Primality Testing
Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).
Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).
Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).
Can this be parallelized?
Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).
Can this be parallelized?Yes! “For loops” are great candidates for parallelization.
Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).
1,2,3,...,k k+1,...,2k … p-k+1,...pCheck batches of divisors in parallel
Parallelism Example: Primality TestingWrite a function to test if p is a prime number (no divisors other than 1 and p).
1,2,3,...,k k+1,...,2k … p-k+1,...pCheck batches of divisors in parallel
Show in Jupyter notebook!
Parallelism Example: Tree Reduction
Parallelism Example: Tree Reduction
Relevant for upcoming homework assignment!
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
1 2 3 4 5 6 7 8
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
1 2 3 4 5 6 7 8
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
add
1 2 3 4 5 6 7 8
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
id2
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
id2 add
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
id2 add
id3
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
id2 add
id3 add
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
id2 add
id3 add
id4
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
id2 add
id3 add
addid4
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
id2 add
id3 add
addid4
id5
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
id2 add
id3 add
add
add
id4
id5
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
id2 add
id3 add
add
add
id4
id5
id6
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
id2 add
id3 add
add
add
add
id4
id5
id6
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1 2 3 4 5 6 7 8
add
id2 add
id3 add
add
add
add
id4
id5
id6 id7
Parallelism Example: Tree Reduction1 2 3 4 5 6 7 8
Add a bunch of values (2 at a time).
Parallelism Example: Tree Reduction
add
1 2 3 4 5 6 7 8
add add add
Add a bunch of values (2 at a time).
Parallelism Example: Tree Reduction
id1
add
1 2 3 4 5 6 7 8
add add add
id2 id3 id4
Add a bunch of values (2 at a time).
Parallelism Example: Tree Reduction
id1
add
1 2 3 4 5 6 7 8
add add add
id2 id3 id4
add add
Add a bunch of values (2 at a time).
Parallelism Example: Tree Reduction
id1
add
1 2 3 4 5 6 7 8
add add add
id2 id3 id4
add add
id5 id6
Add a bunch of values (2 at a time).
Parallelism Example: Tree Reduction
id1
add
1 2 3 4 5 6 7 8
add add add
id2 id3 id4
add add
id5 id6
add
Add a bunch of values (2 at a time).
Parallelism Example: Tree Reduction
id1
add
1 2 3 4 5 6 7 8
add add add
id2 id3 id4
add add
id5 id6
add
id7
Add a bunch of values (2 at a time).
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
What does the difference look like in code?
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
What does the difference look like in code?
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
What does the difference look like in code?
Parallelism Example: Parameter Server
Parallelism Example: Parameter Server
Parallelism Example: Parameter Server
Parallelism Example: Parameter Server
Parallelism Example: Parameter Server
Worker Worker Worker Worker
Parallelism Example: Parameter Server
Worker
Parameter Server Actor
Worker Worker Worker
Parallelism Example: Parameter Server
Worker
Parameter Server Actor
Worker Worker Worker
para
met
ers
grad
ient
s
Parallelism Example: Parameter Server
Parameter Server Actor
Worker
Parameter Server Actor
Parameter Server Actor
Worker Worker Worker
para
met
ers
grad
ient
s
Conclusion● The ML ecosystem consists of many special-purpose systems
○ We are moving toward a single general-purpose system○ “Distributed systems” are becoming “libraries” and “applications”○ Examples like MapReduce, tree reduce, parameter server○ Interesting applications will combine all of these elements
● Ray is open source at https://github.com/ray-project/ray○ pip install ray (on Linux/Mac)○ Being developed here at Berkeley!
● Will be used in the next homework assignment