Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Soumith Chintala
Usable while performant: the challenges building
Problem Statement•Deep Learning Workloads
Problem Statement•Deep Learning Workloads
for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
Problem Statement•Deep Learning Workloads
for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
N samples, each of some shape D
Problem Statement•Deep Learning Workloads
for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
mini-batch of M samples (M << N), each of shape D
Problem Statement•Deep Learning Workloads
for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
neural network with weights
Problem Statement•Deep Learning Workloads
for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
backpropagation: compute derivatives wrt loss, using chain rule
Problem Statement•Deep Learning Workloads
for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
update weights using the computed gradients
Problem Statement•Deep Learning Workloads
for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
Problem Statement•Deep Learning Workloads
for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
neural network with weights
Types of typical operatorsConvolution
Figure by Vincent Dumolin: https://github.com/vdumoulin/conv_arithmetic
Types of typical operatorsConvolution
Figure by Vincent Dumolin: https://github.com/vdumoulin/conv_arithmetic
Types of typical operatorsConvolution
Figure by Vincent Dumolin: https://github.com/vdumoulin/conv_arithmetic
for oc in output_channel: for ic in input_channel: for h in output_height: for w in output_width: for kh in kernel_height: for kw in kernel_width: output_pixel += input_pixel * kernel_value
Types of typical operators
Figure by Vincent Dumolin: https://github.com/vdumoulin/conv_arithmetic
Types of typical operatorsMatrix Multiply
Figure by Wikipedia: https://en.wikipedia.org/wiki/Matrix_multiplication
Types of typical operatorsPointwise operations
Figure by Wikipedia: https://en.wikipedia.org/wiki/Matrix_multiplication
for (i=0; i < data_length; i++) { output[i] = input1[i] + input2[i] }
Types of typical operatorsReduction operations
Figure by Wikipedia: https://en.wikipedia.org/wiki/Matrix_multiplication
double sum = 0.0; for (i=0; i < data_length; i++) { sum += input[i]; }
Chained Together
Figure by Wikipedia: https://en.wikipedia.org/wiki/Matrix_multiplication
Input
BatchNorm
ReLU
Conv2d Output
Chained Together
Figure by Wikipedia: https://en.wikipedia.org/wiki/Matrix_multiplication
Input
BatchNorm
ReLU
Conv2d Output
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
Chained Together
Figure by Wikipedia: https://en.wikipedia.org/wiki/Matrix_multiplication
Input Output
"deep"
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
Chained Together
Figure by Wikipedia: https://en.wikipedia.org/wiki/Matrix_multiplication
Input Output
"deep"
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
recurrent
Trained with Gradient Descent
Figure by Wikipedia: https://en.wikipedia.org/wiki/Matrix_multiplication
Input Output
"deep"
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
BatchNorm
ReLU
Conv2d
recurrent
Problem Statement•Deep Learning Workloads
for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
an easy way to see recurrence
Problem Statement•Deep Learning Workloads
for epoch in range(max_epochs): for data, target in enumerate(training_data): output, hidden = [], zeros() for t in data.size(0): out, hidden = model(data[t], hidden) output.append(out) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
an easy way to see recurrence
Problem Statement•Deep Learning Workloads
- Vision models - model is very deep, straight-line chain with no recurrence - lots of convolutions - typically run on GPUs
Problem Statement•Deep Learning Workloads
- Vision models - model is very deep, straight-line chain with no recurrence - lots of convolutions - typically run on GPUs
- NLP models - LSTM-RNN - model is 1 to 4 "layers" deep - two matmuls across space and time along with pointwise ops -typically run on CPUs if small, GPUs if large
Deep Learning Frameworks•Make this easy to program
for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
Pre-PyTorch
meta programming meta programming imperative
Caffe
Caffe
define protobuf, run via command-line utility
Caffe
define protobuf, run via command-line utility
small, efficient library. Could do convents well.
Theano
Theano
meta-program Theano VM via Python API
Theano
meta-program Theano VM via Python APIwhole program optimizations, graph fusion
Theano
meta-program Theano VM via Python APIwhole program optimizations, graph fusion
graphs took minutes to hours to compile and start
Torch-7
Torch-7
imperative programming in Lua
Torch-7
imperative programming in Lua tied closely to underlying C89 implementations
Torch-7
imperative programming in Lua tied closely to underlying C89 implementations
Lua lacked good tooling and ecosystem
What is PyTorch?
Ndarray library with GPU support
automatic differentiation engine
gradient based optimization package
Deep Learning
Reinforcement LearningNumpy-alternative
Utilities (data loading, etc.)
ndarray library•np.ndarray <-> torch.Tensor •200+ operations, similar to numpy •very fast acceleration on NVIDIA GPUs
ndarray library
Numpy PyTorch
ndarray / Tensor library
ndarray / Tensor library
ndarray / Tensor library
ndarray / Tensor library
NumPy bridge
NumPy bridge
Zero memory-copy very efficient
NumPy bridge
NumPy bridge
Seamless GPU Tensors
automatic differentiation enginefor deep learning and reinforcement learning
W_h = torch.randn(20, 20, requires_grad=True) W_x = torch.randn(20, 10, requires_grad=True) x = torch.randn(1, 10) prev_h = torch.randn(1, 20)
PyTorch Autograd
MMMM
PyTorch AutogradW_h = torch.randn(20, 20, requires_grad=True) W_x = torch.randn(20, 10, requires_grad=True) x = torch.randn(1, 10) prev_h = torch.randn(1, 20)
i2h = torch.mm(W_x, x.t()) h2h = torch.mm(W_h, prev_h.t())
MMMM
PyTorch AutogradW_h = torch.randn(20, 20, requires_grad=True) W_x = torch.randn(20, 10, requires_grad=True) x = torch.randn(1, 10) prev_h = torch.randn(1, 20)
i2h = torch.mm(W_x, x.t()) h2h = torch.mm(W_h, prev_h.t()) next_h = i2h + h2h
Add
MMMM
PyTorch AutogradW_h = torch.randn(20, 20, requires_grad=True) W_x = torch.randn(20, 10, requires_grad=True) x = torch.randn(1, 10) prev_h = torch.randn(1, 20)
i2h = torch.mm(W_x, x.t()) h2h = torch.mm(W_h, prev_h.t()) next_h = i2h + h2h
Add
MMMM
Tanh
PyTorch AutogradW_h = torch.randn(20, 20, requires_grad=True) W_x = torch.randn(20, 10, requires_grad=True) x = torch.randn(1, 10) prev_h = torch.randn(1, 20)
i2h = torch.mm(W_x, x.t()) h2h = torch.mm(W_h, prev_h.t()) next_h = i2h + h2h next_h = next_h.tanh()
W_h = torch.randn(20, 20, requires_grad=True) W_x = torch.randn(20, 10, requires_grad=True) x = torch.randn(1, 10) prev_h = torch.randn(1, 20)
i2h = torch.mm(W_x, x.t()) h2h = torch.mm(W_h, prev_h.t()) next_h = i2h + h2h next_h = next_h.tanh()
next_h.backward(torch.ones(1, 20))
Add
MMMM
Tanh
PyTorch Autograd
Neural Networks
Neural Networks
Neural Networks
Optimization packageSGD, Adagrad, RMSProp, LBFGS, etc.
Bootstrapping
Writing Dataset loaders
Building models Implementing Training loop
Checkpointing models
Interfacing with environments
Dealing with GPUsBuilding optimizers Building
Baselines
Python + PyTorch - an environment to do all of this
Bootstrapping
Writing Dataset loaders
Building models Implementing Training loop
Checkpointing models
Interfacing with environments
Dealing with GPUsBuilding optimizers Building
Baselines
Python + PyTorch - an environment to do all of thisbootstrapping the Python tooling stack for good UX
Python is slow, interpreted•Global interpreter-lock •application logic is order of magnitude slower than C++ •moved autograd engine to C++ •moved everything to ATen
- Side-effect, a clean C++ API