41
ML / PyTorch Crash Course Alex Tamkin

ML / PyTorch Crash Course - Stanford University

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ML / PyTorch Crash Course - Stanford University

ML / PyTorch Crash CourseAlex Tamkin

Page 2: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Prelude

○ Anything you especially want to focus on?

○ Don't expect to understand all of this perfectly from today!

○ Drinking from a firehose○ Slides will be uploaded

Page 3: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

ML Crash Course

Page 4: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Neural Network Classifiers

Neural Network

Defective0.1%

OK99.9%

Page 5: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Inside a neural net

[.2, .4, .5 ...] Lots of matrices

...

Defective0.1%

OK99.9%

Neural nets learn transformations from inputs to outputs

Page 6: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Training a neural net

○ You don't program the neural net, the data programs the neural net

○ It learns through examples

OK

Defective

OK

Page 7: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Inside a neural net○ Simplest neural net: y = softmax(Ax)○ x = input image

○ (size 16x16?)○ y = label

○ (size 2)○ [0.0, 1.0] or [1.0, 0.0]

○ A○ matrix that goes from 16x16 to 2

○ Softmax: makes sure numbers add up to 1 ○ So it's a probability distribution!

Page 8: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Optimization (example: faucet)

○ Loss / objective○ Facuet: how happy you are

with the temp/pressure○ NN: how much your

network's predictions line up with the labels○ Rewarded based on the

probability you assigned to correct answer

Page 9: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Optimization (example: faucet)

○ Parameters: determine behavior of your model

○ Start close to 0○ Faucet: positions of the

handles (sad, no pressure)○ NN: entries of the

matrices

Page 10: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Optimization (example: faucet)

○ Optimization algorithm○ SGD: Stochastic gradient

descent○ Compute loss on a "batch"

of data (several examples)○ Gradient: what direction

to push each parameter to decrease loss?

Page 11: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Optimization (example: faucet)

○ Learning rate: how much to push parameters based on direction

○ Too big: can overshoot!○ Think of when the handles

are sticky and the water gets too hot suddenly

○ Too slow: takes forever

Page 12: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Optimization

○ Activation functions○ needed for nonlinear relationships (can't capture input-output

relationship with just matrices)○ "ReLU" is just max(0, x)

[.2, .4, .5 ...]Matrix

+ ReLU

Matrix +

ReLU

Matrix +

ReLULoss

Page 13: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Optimization

○ Computational graph○ Series of steps traversed from input to output

[.2, .4, .5 ...]Matrix

+ ReLU

Matrix +

ReLU

Matrix +

ReLULoss

Page 14: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Optimization

○ Computational graph○ Series of steps traversed from input to output

[.2, .4, .5 ...]Matrix

+ ReLU

Matrix +

ReLU

Matrix +

ReLULoss

Page 15: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Can be pretty wacky (InceptionNet)

Page 16: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Loss curve

Page 17: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Loss curve

A lot better!

Page 18: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Distributed representations○ Vectors between the layers

○ Especially towards the end of the NN○ Group similar things near each other○ Some insight into what models are doing!

[.2, .4, .5 ...]Matrix

+ ReLU

Matrix +

ReLU

Matrix +

ReLULoss

Page 19: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

PyTorch Crash Course

Page 20: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

The magic of PyTorch

○ Would be a huge pain to write all the matrices ourselves○ and a huger pain to compute the gradients○ PyTorch lets us

○ Describe the steps from input to output○ Define the loss, optimizer, learning rate○ Input the data○ Then it updates the parameters accordingly! :)

Page 21: ML / PyTorch Crash Course - Stanford University

Defining the model

○ nn.Module○ lets pytorch keep track of

params ○ __init__

○ Define the parameters in initialization

○ forward○ "Forward pass"○ How the net goes from input to

output

Page 22: ML / PyTorch Crash Course - Stanford University

○ Linear○ A linear layer○ Fc: "fully connected"○ Matrix A and vector b○ Input: x, output Ax + b

Defining the model

Page 23: ML / PyTorch Crash Course - Stanford University

Defining the model

○ Conv2d○ 2D Convolutional layers ○ Special layers for images○ Lets us tile a tiny matrix across

the image, instead of one big matrix

○ Works better and takes less memory

Page 24: ML / PyTorch Crash Course - Stanford University

Defining the model

○ maxpool_2d○ Makes the output of a layer

smaller by averaging adjacent entries

○ Helps get from a large image to a binary decision

Page 25: ML / PyTorch Crash Course - Stanford University

○ Dropout○ Helps prevent memorization○ Randomly "zeros out" some

entries in the matrix each forward pass

○ Slightly magic

Defining the model

Page 26: ML / PyTorch Crash Course - Stanford University

○ log_softmax○ Softmax takes exp() of every

number in the vector○ Then normalizes them to sum

to one○ This gets us a probability

distribution○ We return the log probabilities

for numerical stability

Defining the model

Page 27: ML / PyTorch Crash Course - Stanford University

○ datasets.MNIST○ MNIST is a handwriting

classification dataset○ Helpful for post offices!○ train=True/False defines the

train/test split○ we want to test our model

on things it hasn't been trained on

Defining the data

Page 28: ML / PyTorch Crash Course - Stanford University

○ Transforms○ In this case, just tensorizes +

normalizes○ Can apply data augmentations

applied to images○ Makes the dataset "bigger"○ Harder to just memorize○ E.g. random flipping, cropping○ Don't want to do this for

numbers!

Defining the data

Page 29: ML / PyTorch Crash Course - Stanford University

○ DataLoader○ Data processing can take a

while - don't want your GPU to be waiting

○ Applies transformations in parallel

○ Returns batches

Defining the data

Page 30: ML / PyTorch Crash Course - Stanford University

○ .to(device)○ Sends it to the GPU, if you have

one○ Optimizer

○ Smarter version of SGD○ Tunes learning rate for each

parameter○ Training loop

○ Updates parameters on full dataset, then evaluates it

Training + testing

Page 31: ML / PyTorch Crash Course - Stanford University

○ For batch_idx, (data, target) in enumerate(train_loader)○ Fetches a batch○ Data is a tensor of size

[batch_size, num_channels, height, width]

○ Target is the label (which number)

Training loop

Page 32: ML / PyTorch Crash Course - Stanford University

○ .train()○ Enables layers only used during

training (e.g. dropout)○ optimizer.zero_grad()

○ Discards gradients computed last batch, for old parameters

○ Output = model(data)○ Runs model on a batch of data!

Training loop

Page 33: ML / PyTorch Crash Course - Stanford University

○ F.nll_loss(output, target)○ Negative log likelihood○ -log(p_correct_answer)○ This is lower the higher the

probability you assigned to the correct answer!

○ loss.backwards()○ Compute gradients

○ optimizer.step()○ Update params with gradients!

Training loop

Page 34: ML / PyTorch Crash Course - Stanford University

○ Model.eval○ Disable stuff like dropout

○ torch.no_grad()○ Don't keep track of

computational graph (we're not computing gradients)

○ Computes accuracy based on class with highest predicted probability

Evaluation loop

Page 35: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

PyTorch Lightning

○ Organization○ PyTorch is super useful, but can be kinda messy / disorganized○ PL provides a nice way to structure your code

○ Functionality○ In PyTorch, you have to do both research code (modeling) and

engineering code (loading model onto GPU, remembering best practices about data loading)

○ PL automates a lot of this (can just set gpus=8 and it will do it for you)

Page 36: ML / PyTorch Crash Course - Stanford University

PyTorch Lightning

Page 37: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Weights and Biases

Keep track of experiments easily

Page 38: ML / PyTorch Crash Course - Stanford University

Weights and Biases

Page 39: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Hydra

○ You'll run a lot of experiments with different configurations○ Hydra is a tool to help you manage these

○ Without changing them manually in your code each time!

Page 40: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Hydra

Page 41: ML / PyTorch Crash Course - Stanford University

| Alex Tamkin | @alextamkin

Google Cloud

○ Colab is nice, easy and free○ But can be a pain to use (keep getting disconnected)○ If you want to train longer, can use Google Cloud

○ You start with $300 free, then we can supplement with extra $50○ More of a pain to set up, but get dedicated GPU○ Really helpful guide: https://github.com/cs231n/gcloud○ Crucial: stop your instances when you're not using them

○ Credits will keep rolling and you'll be sad