ML / PyTorch Crash Course - Stanford University

ML / PyTorch Crash CourseAlex Tamkin

| Alex Tamkin | @alextamkin

Prelude

○ Anything you especially want to focus on?

○ Don't expect to understand all of this perfectly from today!

○ Drinking from a firehose○ Slides will be uploaded


ML Crash Course


Neural Network Classifiers

Neural Network

Defective0.1%

OK99.9%


Inside a neural net

[.2, .4, .5 ...] Lots of matrices

...

Defective0.1%

OK99.9%

Neural nets learn transformations from inputs to outputs


Training a neural net

○ You don't program the neural net, the data programs the neural net

○ It learns through examples

OK

Defective

OK


Inside a neural net○ Simplest neural net: y = softmax(Ax)○ x = input image

○ (size 16x16?)○ y = label

○ (size 2)○ [0.0, 1.0] or [1.0, 0.0]

○ A○ matrix that goes from 16x16 to 2

○ Softmax: makes sure numbers add up to 1 ○ So it's a probability distribution!


Optimization (example: faucet)

○ Loss / objective○ Facuet: how happy you are

with the temp/pressure○ NN: how much your

network's predictions line up with the labels○ Rewarded based on the

probability you assigned to correct answer



○ Parameters: determine behavior of your model

○ Start close to 0○ Faucet: positions of the

handles (sad, no pressure)○ NN: entries of the

matrices



○ Optimization algorithm○ SGD: Stochastic gradient

descent○ Compute loss on a "batch"

of data (several examples)○ Gradient: what direction

to push each parameter to decrease loss?



○ Learning rate: how much to push parameters based on direction

○ Too big: can overshoot!○ Think of when the handles

are sticky and the water gets too hot suddenly

○ Too slow: takes forever


Optimization

○ Activation functions○ needed for nonlinear relationships (can't capture input-output

relationship with just matrices)○ "ReLU" is just max(0, x)

[.2, .4, .5 ...]Matrix

+ ReLU

Matrix +

ReLU

Matrix +

ReLULoss


Optimization

○ Computational graph○ Series of steps traversed from input to output

[.2, .4, .5 ...]Matrix

+ ReLU

Matrix +

ReLU

Matrix +

ReLULoss


Optimization

○ Computational graph○ Series of steps traversed from input to output

[.2, .4, .5 ...]Matrix

+ ReLU

Matrix +

ReLU

Matrix +

ReLULoss


Can be pretty wacky (InceptionNet)


Loss curve


Loss curve

A lot better!


Distributed representations○ Vectors between the layers

○ Especially towards the end of the NN○ Group similar things near each other○ Some insight into what models are doing!

[.2, .4, .5 ...]Matrix

+ ReLU

Matrix +

ReLU

Matrix +

ReLULoss


PyTorch Crash Course


The magic of PyTorch

○ Would be a huge pain to write all the matrices ourselves○ and a huger pain to compute the gradients○ PyTorch lets us

○ Describe the steps from input to output○ Define the loss, optimizer, learning rate○ Input the data○ Then it updates the parameters accordingly! :)

Defining the model

○ nn.Module○ lets pytorch keep track of

params ○ __init__

○ Define the parameters in initialization

○ forward○ "Forward pass"○ How the net goes from input to

output

○ Linear○ A linear layer○ Fc: "fully connected"○ Matrix A and vector b○ Input: x, output Ax + b

Defining the model

Defining the model

○ Conv2d○ 2D Convolutional layers ○ Special layers for images○ Lets us tile a tiny matrix across

the image, instead of one big matrix

○ Works better and takes less memory

Defining the model

○ maxpool_2d○ Makes the output of a layer

smaller by averaging adjacent entries

○ Helps get from a large image to a binary decision

○ Dropout○ Helps prevent memorization○ Randomly "zeros out" some

entries in the matrix each forward pass

○ Slightly magic

Defining the model

○ log_softmax○ Softmax takes exp() of every

number in the vector○ Then normalizes them to sum

to one○ This gets us a probability

distribution○ We return the log probabilities

for numerical stability

Defining the model

○ datasets.MNIST○ MNIST is a handwriting

classification dataset○ Helpful for post offices!○ train=True/False defines the

train/test split○ we want to test our model

on things it hasn't been trained on

Defining the data

○ Transforms○ In this case, just tensorizes +

normalizes○ Can apply data augmentations

applied to images○ Makes the dataset "bigger"○ Harder to just memorize○ E.g. random flipping, cropping○ Don't want to do this for

numbers!

Defining the data

○ DataLoader○ Data processing can take a

while - don't want your GPU to be waiting

○ Applies transformations in parallel

○ Returns batches

Defining the data

○ .to(device)○ Sends it to the GPU, if you have

one○ Optimizer

○ Smarter version of SGD○ Tunes learning rate for each

parameter○ Training loop

○ Updates parameters on full dataset, then evaluates it

Training + testing

○ For batch_idx, (data, target) in enumerate(train_loader)○ Fetches a batch○ Data is a tensor of size

[batch_size, num_channels, height, width]

○ Target is the label (which number)

Training loop

○ .train()○ Enables layers only used during

training (e.g. dropout)○ optimizer.zero_grad()

○ Discards gradients computed last batch, for old parameters

○ Output = model(data)○ Runs model on a batch of data!

Training loop

○ F.nll_loss(output, target)○ Negative log likelihood○ -log(p_correct_answer)○ This is lower the higher the

probability you assigned to the correct answer!

○ loss.backwards()○ Compute gradients

○ optimizer.step()○ Update params with gradients!

Training loop

○ Model.eval○ Disable stuff like dropout

○ torch.no_grad()○ Don't keep track of

computational graph (we're not computing gradients)

○ Computes accuracy based on class with highest predicted probability

Evaluation loop


PyTorch Lightning

○ Organization○ PyTorch is super useful, but can be kinda messy / disorganized○ PL provides a nice way to structure your code

○ Functionality○ In PyTorch, you have to do both research code (modeling) and

engineering code (loading model onto GPU, remembering best practices about data loading)

○ PL automates a lot of this (can just set gpus=8 and it will do it for you)

PyTorch Lightning


Weights and Biases

Keep track of experiments easily

Weights and Biases


Hydra

○ You'll run a lot of experiments with different configurations○ Hydra is a tool to help you manage these

○ Without changing them manually in your code each time!


Hydra


Google Cloud

○ Colab is nice, easy and free○ But can be a pain to use (keep getting disconnected)○ If you want to train longer, can use Google Cloud

○ You start with $300 free, then we can supplement with extra $50○ More of a pain to set up, but get dedicated GPU○ Really helpful guide: https://github.com/cs231n/gcloud○ Crucial: stop your instances when you're not using them

○ Credits will keep rolling and you'll be sad

https://github.com/cs231n/gcloud

Documents

ML / PyTorch Crash Course - Stanford University