Soumith Chintala - IBMProblem Statement •Deep Learning Workloads - Vision models - model is very deep, straight-line chain with no recurrence - lots of convolutions - typically run

Soumith Chintala

Usable while performant: the challenges building

Problem Statement•Deep Learning Workloads


for epoch in range(max_epochs): for data, target in enumerate(training_data): output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()



N samples, each of some shape D



mini-batch of M samples (M << N), each of shape D



neural network with weights



backpropagation: compute derivatives wrt loss, using chain rule



update weights using the computed gradients





neural network with weights

Types of typical operatorsConvolution

Figure by Vincent Dumolin: https://github.com/vdumoulin/conv_arithmetic

https://github.com/vdumoulin/conv_arithmetic






for oc in output_channel: for ic in input_channel: for h in output_height: for w in output_width: for kh in kernel_height: for kw in kernel_width: output_pixel += input_pixel * kernel_value


Types of typical operators



Types of typical operatorsMatrix Multiply

Figure by Wikipedia: https://en.wikipedia.org/wiki/Matrix_multiplication

https://en.wikipedia.org/wiki/Matrix_multiplication

Types of typical operatorsPointwise operations


for (i=0; i < data_length; i++) { output[i] = input1[i] + input2[i] }


Types of typical operatorsReduction operations


double sum = 0.0; for (i=0; i < data_length; i++) { sum += input[i]; }


Chained Together


Input

BatchNorm

ReLU

Conv2d Output


Chained Together


Input

BatchNorm

ReLU

Conv2d Output

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d


Chained Together


Input Output

"deep"

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d


Chained Together


Input Output

"deep"

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

recurrent


Trained with Gradient Descent


Input Output

"deep"

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

BatchNorm

ReLU

Conv2d

recurrent




an easy way to see recurrence


for epoch in range(max_epochs): for data, target in enumerate(training_data): output, hidden = [], zeros() for t in data.size(0): out, hidden = model(data[t], hidden) output.append(out) loss = F.nll_loss(output, target) loss.backward() optimizer.step()

an easy way to see recurrence


- Vision models - model is very deep, straight-line chain with no recurrence - lots of convolutions - typically run on GPUs


- Vision models - model is very deep, straight-line chain with no recurrence - lots of convolutions - typically run on GPUs

- NLP models - LSTM-RNN - model is 1 to 4 "layers" deep - two matmuls across space and time along with pointwise ops -typically run on CPUs if small, GPUs if large

Deep Learning Frameworks•Make this easy to program


Pre-PyTorch

meta programming meta programming imperative

Caffe

Caffe

define protobuf, run via command-line utility

Caffe

define protobuf, run via command-line utility

small, efficient library. Could do convents well.

Theano

Theano

meta-program Theano VM via Python API

Theano

meta-program Theano VM via Python APIwhole program optimizations, graph fusion

Theano

meta-program Theano VM via Python APIwhole program optimizations, graph fusion

graphs took minutes to hours to compile and start

Torch-7

Torch-7

imperative programming in Lua

Torch-7

imperative programming in Lua tied closely to underlying C89 implementations

Torch-7

imperative programming in Lua tied closely to underlying C89 implementations

Lua lacked good tooling and ecosystem

What is PyTorch?

Ndarray library with GPU support

automatic differentiation engine

gradient based optimization package

Deep Learning

Reinforcement LearningNumpy-alternative

Utilities (data loading, etc.)

ndarray library•np.ndarray <-> torch.Tensor •200+ operations, similar to numpy •very fast acceleration on NVIDIA GPUs

ndarray library

Numpy PyTorch

ndarray / Tensor library




NumPy bridge

NumPy bridge

Zero memory-copy very efficient

NumPy bridge

NumPy bridge

Seamless GPU Tensors

automatic differentiation enginefor deep learning and reinforcement learning

W_h = torch.randn(20, 20, requires_grad=True) W_x = torch.randn(20, 10, requires_grad=True) x = torch.randn(1, 10) prev_h = torch.randn(1, 20)

PyTorch Autograd

MMMM

PyTorch AutogradW_h = torch.randn(20, 20, requires_grad=True) W_x = torch.randn(20, 10, requires_grad=True) x = torch.randn(1, 10) prev_h = torch.randn(1, 20)

i2h = torch.mm(W_x, x.t()) h2h = torch.mm(W_h, prev_h.t())

MMMM


i2h = torch.mm(W_x, x.t()) h2h = torch.mm(W_h, prev_h.t()) next_h = i2h + h2h

Add

MMMM


i2h = torch.mm(W_x, x.t()) h2h = torch.mm(W_h, prev_h.t()) next_h = i2h + h2h

Add

MMMM

Tanh


i2h = torch.mm(W_x, x.t()) h2h = torch.mm(W_h, prev_h.t()) next_h = i2h + h2h next_h = next_h.tanh()

W_h = torch.randn(20, 20, requires_grad=True) W_x = torch.randn(20, 10, requires_grad=True) x = torch.randn(1, 10) prev_h = torch.randn(1, 20)

i2h = torch.mm(W_x, x.t()) h2h = torch.mm(W_h, prev_h.t()) next_h = i2h + h2h next_h = next_h.tanh()

next_h.backward(torch.ones(1, 20))

Add

MMMM

Tanh

PyTorch Autograd

Neural Networks

Neural Networks

Neural Networks

Optimization packageSGD, Adagrad, RMSProp, LBFGS, etc.

Bootstrapping

Writing Dataset loaders

Building models Implementing Training loop

Checkpointing models

Interfacing with environments

Dealing with GPUsBuilding optimizers Building

Baselines

Python + PyTorch - an environment to do all of this

Bootstrapping

Writing Dataset loaders

Building models Implementing Training loop

Checkpointing models

Interfacing with environments

Dealing with GPUsBuilding optimizers Building

Baselines

Python + PyTorch - an environment to do all of thisbootstrapping the Python tooling stack for good UX

Python is slow, interpreted•Global interpreter-lock •application logic is order of magnitude slower than C++ •moved autograd engine to C++ •moved everything to ATen

- Side-effect, a clean C++ API

Documents

Soumith Chintala - IBMProblem Statement •Deep Learning Workloads - Vision models - model is very deep, straight-line chain with no recurrence - lots of convolutions - typically run