Lecture: Introduction to Deep Learningvision.stanford.edu/.../files/19_deep_learning.pdf · 2017-12-05 · Deep Learning Learning hierarchical representations from data End-to-end

Stanford UniversityStanford University

Lecture:Introduction to Deep Learning

Juan Carlos Niebles and Ranjay KrishnaStanford Vision and Learning Lab

Lecture 19 - 1 6 Dec 2016

Slides adapted from Justin Johnson

Stanford University

So far this quarter


Edge detection

RANSAC

SIFT

K-Means

Mean-shift

Linear classifiers

Image features

PCA / Eigenfaces

5 Dec 2016

Stanford UniversityStanford University Lecture 19 - 3 6 Dec 2016

Current Research

ECCV 2016(photo credit: Justin Johnson)

Stanford University

Current Research


ECCV 2016(photo credit: Justin Johnson)

Convolutional neural networks

Recurrent neural networks

Autoencoders

Fully convolutional

networks

Multilayer perceptrons

What are these things?

Why are they important?

5 Dec 2016

Stanford University

Deep Learning

● Learning hierarchical representations from data● End-to-end learning: raw inputs to predictions● Can use a small set of simple tools to solve many problems● Has led to rapid progress on many problems ● (Very loosely!) Inspired by the brain● Today: 1 hour introduction to deep learning fundamentals● To learn more take CS 231n in the Spring

Lecture 19 - 5 6 Dec 20165 Dec 2016

Stanford University

Deep learning for many different problems


Stanford University Lecture 19 - 7 6 Dec 20165 Dec 2016

Stanford University Lecture 19 - 8 6 Dec 2016

Slide credit: CS 231n, Lecture 1, 2016

5 Dec 2016


Slide credit: Kaiming He, ICCV 2015

5 Dec 2016



≈5.1

Human

Human benchmark from Russakovsky et al, “ImageNet Large Scale Visual Recognition Challenge”, IJCV 2015

5 Dec 2016



5 Dec 2016


Slide credit: Ross Girshick, ICCV 2015

Deep Learning for Object Detection

5 Dec 2016


Slide credit: Ross Girshick, ICCV 2015

Deep Learning for Object DetectionCurrent SOTA: 85.6

He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016

5 Dec 2016

Stanford University

Object segmentation

Lecture 19 - 14 6 Dec 2016

Figure credit: Dai, He, and Sun, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, CVPR 2016

5 Dec 2016

Stanford University

Pose Estimation

Lecture 19 - 15 6 Dec 2016

Figure credit: Cao et al, “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”, arXiv 2016

5 Dec 2016

Stanford University

Deep Learning for Image Captioning

Lecture 19 - 16 6 Dec 2016

Figure credit: Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015

5 Dec 2016

Stanford University

Dense Image Captioning

Lecture 19 - 17 6 Dec 2016

Figure credit: Johnson*, Karpathy*, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016

5 Dec 2016

Stanford University

Visual Question Answering

Lecture 19 - 18 6 Dec 2016

Figure credit: Agrawal et al, “VQA: Visual Question Answering”, ICCV 2015

Figure credit: Zhu et al, “Visual7W: Grounded Question Answering in Images”, CVPR 2016

5 Dec 2016

Stanford University

Image Super-Resolution

Lecture 19 - 19 6 Dec 2016

Figure credit: Ledig et al, “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”, arXiv 2016

5 Dec 2016

Stanford University

Generating Art

Lecture 19 - 20 6 Dec 2016

Figure credit: Gatys, Ecker, and Bethge, “Image Style Transfer using Convolutional Neural Networks”, CVPR 2016

Figure credit: Johnson, Alahi, and Fei-Fei: “Perceptual Losses for Real-Time Style Transfer and Super-Resolution”, ECCV 2016, https://github.com/jcjohnson/fast-neural-style

Figure credit: Mordvintsev, Olah, and Tyka, “Inceptionism: Going Deeper into Neural Networks”, https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html

5 Dec 2016

https://github.com/jcjohnson/fast-neural-style

https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html

Stanford University

Outside Computer Vision

Lecture 19 - 21 6 Dec 2016

Machine Translation

Figure credit: Bahdanau, Cho, and Bengio, “Neural Machine Translation by jointly learning to align and translate”, ICLR 2015

Speech Recognition

Figure credit: Amodei et al, “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin”, arXiv 2015

Speech Synthesis

Figure credit: var der Oord et al, “WaveNet: A Generative Model for Raw Audio”, arXiv 2016, https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Text Synthesis

Figure credit: Sutskever, Martens, and Hinton, “Generating Text with Recurrent Neural Networks”, ICML 2011

5 Dec 2016

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Stanford University

Deep Reinforcement Learning

Lecture 19 - 22 6 Dec 2016

Playing Atari games

Mnih et al, “Human-level control through deep reinforcement learning”, Nature 2015

Silver et al, “Mastering the game of Go with deepneural networks and tree search”, Nature 2016

Image credit: http://www.newyorker.com/tech/elements/alphago-lee-sedol-and-the-reassuring-future-of-humans-and-machines

AlphaGo beats Lee Sedol

5 Dec 2016

http://www.youtube.com/watch?v=V1eYniJ0Rnk


Outline

● Applications● Motivation● Supervised learning● Gradient descent● Backpropagation● Convolutional Neural Networks

5 Dec 2016


Outline


5 Dec 2016

Stanford University

Recall: Image Classification

Lecture 19 - 25 6 Dec 2016

Write a function that maps images to labels(also called object recognition)

Dataset: ETH-80, by B. Leibe; Slide credit: L. Lazebnik

No obvious way to implement this function!


5 Dec 2016

Stanford University


Lecture 19 - 26 6 Dec 2016

Example training set


5 Dec 2016

Stanford University


Lecture 19 - 27 6 Dec 2016

Figure credit: D. Hoiem, L. Lazebnik

5 Dec 2016

Stanford University


Lecture 19 - 28 6 Dec 2016

Features remain fixed Classifier is learned from data


5 Dec 2016

Stanford University


Lecture 19 - 29 6 Dec 2016


Problem: How do we know which features to use? We may need different features for each problem!


5 Dec 2016

Stanford University


Lecture 19 - 30 6 Dec 2016


Problem: How do we know which features to use? We may need different features for each problem!

Solution: Learn the features jointly with the classifier!


5 Dec 2016


Image Classification: Feature Learning

Image features Classifier

Training

Training labels

Learned ModelLearned features

Learned Classifier

Learned ModelLearned features

Learned Classifier

Prediction

5 Dec 2016


Image Classification: Deep Learning

Low-level features

Training

Training labels

Learned ModelLow-level features

Mid-level features

Mid-level features

High-level features

Classifier

High-level features

Classifier

Model is deep: it has many layers

Models can learn hierarchical features

5 Dec 2016


Image Classification: Deep Learning

Low-level features

Training

Training labels

Learned ModelLow-level features

Mid-level features

Mid-level features

High-level features

Classifier

High-level features

Classifier

Model is deep: it has many layers

Models can learn hierarchical features

Figure credit: Zeiler and Fergus, “Visualizing and Understanding Convolutional Networks”, ECCV 2014

5 Dec 2016


Outline


5 Dec 2016

Stanford University

Supervised Learning in 3 easy stepsHow to learn models from data

Lecture 19 - 35 6 Dec 2016

Step 1: Define Model

5 Dec 2016

Stanford University


Lecture 19 - 36 6 Dec 2016

Input data(Image pixels)

Model weights

Predicted output (image label)

Model structure


5 Dec 2016

Stanford University

Supervised learning

Lecture 19 - 37 5 Dec 2016

b

Stanford University


Lecture 19 - 38 6 Dec 2016


Model weights


Model structure Training

inputTrue

output


Step 2: Collect data

5 Dec 2016

Stanford University


Lecture 19 - 39 6 Dec 2016


Model weights


Model structure Training

inputTrue

output



Step 3: Learn the model

5 Dec 2016

Stanford University


Lecture 19 - 40 6 Dec 2016


Model weights


Model structure

Learned weights

Minimize average loss over training set

Loss function:Measures “badness”

of prediction

Traininginput

True output

Predicted output



Step 3: Learn the model

Regularizer:Penalizes complex

models

5 Dec 2016


Regularizer is Frobenius norm of matrix (sum of

squares of entries)

Learning Problem

Input and output are vectors Loss is Euclidean distance

Linear Regression

Model is just a matrix multiply

Supervised Learning: Linear regressionSometimes called “Ridge Regression” with regularizer

5 Dec 2016

Stanford University

Supervised learning

Lecture 19 - 42 5 Dec 2016

b


Regularizer is Frobenius norm of matrix (sum of

squares of entries)

Learning Problem


Linear Regression


Supervised Learning: Linear regressionSometimes called “Ridge Regression” with regularizer

5 Dec 2016



Linear Regression


Model is two matrix multiplies

New Model

Supervised Learning: Neural NetworkSometimes called “Fully-Connected Network” or “Multilayer Perceptron”

5 Dec 2016


Question: Is the new model “more powerful” than Linear Regression? Can it represent any functions that Linear Regression cannot?


Linear Regression



New Model


5 Dec 2016


Question: Is the new model “more powerful” than Linear Regression? Can it represent any functions that Linear Regression cannot?

Answer: NO! We can writeAnd recover Linear Regression


Linear Regression



New Model


5 Dec 2016


Input and output are vectors

Model is two matrix multiplies, with an

elementwise nonlinearity

Loss is Euclidean distance

Linear Regression Neural Network



5 Dec 2016









Rectified Linear (ReLU) Logistic Sigmoid

Common nonlinearities:

5 Dec 2016









Rectified Linear (ReLU) Logistic Sigmoid

Common nonlinearities:

Neural Network is more powerful than Linear Regression!

5 Dec 2016


Neural Networks with many layersSometimes called “Fully-Connected Network” or “Multilayer Perceptron”

Two Layer network

5 Dec 2016



Two Layer networkThree Layer network

5 Dec 2016



Two Layer networkThree Layer network

Hidden layers are learned feature representations of the input!5 Dec 2016


Outline


5 Dec 2016

Stanford University

How to find best weights ?

Lecture 19 - 54 6 Dec 20165 Dec 2016

Stanford University


Lecture 19 - 55 6 Dec 2016

Slide credit: CS 231n, Lecture 3

5 Dec 2016

Stanford University


Lecture 19 - 56 6 Dec 2016

Idea #1: Closed Form SolutionWorks for some models (linear regression) but in general very hard (even for one-layer net)


5 Dec 2016

Stanford University


Lecture 19 - 57 6 Dec 2016


Idea #2: Random SearchTry a bunch of random values for

w, keep the best oneExtremely inefficient in high

dimensions


5 Dec 2016

Stanford University


Lecture 19 - 58 6 Dec 2016


Idea #2: Random SearchTry a bunch of random values for

w, keep the best oneExtremely inefficient in high

dimensions

Idea #3: Go downhillStart somewhere random, take tiny steps downhill and repeat

Good idea!Slide credit: CS 231n, Lecture 3

5 Dec 2016

Stanford University

Which way is downhill?

Lecture 19 - 59 6 Dec 2016

Recall from single-variable calculus:

If w is a scalar the derivative tells us how much g will change if w changes a little bit

5 Dec 2016

Stanford University


Lecture 19 - 60 6 Dec 2016



If w is a vector then we can take partial derivatives with respect to each component

The gradient is the vector of all partial derivatives; it has the

same shape as w

And multivariable calculus:

5 Dec 2016

Stanford University


Lecture 19 - 61 6 Dec 2016



If w is a vector then we can take partial derivatives with respect to each component

The gradient is the vector of all partial derivatives; it has the

same shape as w

And multivariable calculus:

Useful fact: The gradient of g at w gives the direction of steepest increase

5 Dec 2016

Stanford University

Gradient descentThe simple algorithm behind all deep learning

Lecture 19 - 62 6 Dec 2016

Initialize w randomlyWhile true: Compute gradient at current point

Move downhill a little bit:

5 Dec 2016

Stanford University


Lecture 19 - 63 6 Dec 2016



Learning rate: How big each step should be

5 Dec 2016

Stanford University


Lecture 19 - 64 6 Dec 2016




In practice we estimate the gradient using a small minibatch of data

(Stochastic gradient descent)

5 Dec 2016

Stanford University


Lecture 19 - 65 6 Dec 2016




In practice we estimate the gradient using a small minibatch of data

(Stochastic gradient descent)

The update rule is the formula for updating the weights at each

iteration; in practice people use fancier rules

(momentum, RMSProp, Adam, etc)

5 Dec 2016



Image credit: Alec Radford

5 Dec 2016


Outline


5 Dec 2016


How to compute gradients?

Work it out on paper?

?

Linear Regression

Doable for simple models

5 Dec 2016


How to compute gradients?

Work it out on paper?Three layer network

Gets hairy for more complex models

5 Dec 2016

Stanford University

Backpropagation

Lecture 19 - 70 6 Dec 2016

e.g. x = -2, y = 5, z = -4


Computational graph

5 Dec 2016


e.g. x = -2, y = 5, z = -4

Want:

Backpropagation


Computational graph

5 Dec 2016


e.g. x = -2, y = 5, z = -4

Want:

Backpropagation


Computational graph

5 Dec 2016


e.g. x = -2, y = 5, z = -4

Want:

Backpropagation


Computational graph

5 Dec 2016


e.g. x = -2, y = 5, z = -4

Want:

Backpropagation


Computational graph

5 Dec 2016


e.g. x = -2, y = 5, z = -4

Want:

Backpropagation


Computational graph

5 Dec 2016


e.g. x = -2, y = 5, z = -4

Want:

Backpropagation


Computational graph

5 Dec 2016


e.g. x = -2, y = 5, z = -4

Want:

Backpropagation


Computational graph

5 Dec 2016


e.g. x = -2, y = 5, z = -4

Want:

Backpropagation


Computational graph

5 Dec 2016


e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

Backpropagation


Computational graph

5 Dec 2016


e.g. x = -2, y = 5, z = -4

Want:

Backpropagation


Computational graph

5 Dec 2016


e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

Backpropagation


Computational graph

5 Dec 2016

Stanford University

Backpropagation

Lecture 19 - 82 6 Dec 2016

Computational graph nodes can be vectors or matrices or n-dimensional tensors

x

W1 W2

* *y

5 Dec 2016

Stanford University

Backpropagation

Lecture 19 - 83 6 Dec 2016


x

W1 W2

* *y

Forward pass: Run graph “forward” to compute lossBackward pass: Run graph “backward” to compute

gradients with respect to lossEasily compute gradients for big, complex models!

5 Dec 2016

Stanford University

Backpropagation

Lecture 19 - 84 6 Dec 2016


x

W1 W2

* *y

Forward pass: Run graph “forward” to compute lossBackward pass: Run graph “backward” to compute

gradients with respect to lossEasily compute gradients for big, complex models!

Tensors flow along edges in the graph5 Dec 2016


A two-layer neural network in 25 lines of code

5 Dec 2016



Randomly initialize weights

5 Dec 2016



Get a batch of (random) data

5 Dec 2016



Forward pass: compute loss

ReLU nonlinearity


5 Dec 2016



Backward pass: compute gradients

5 Dec 2016



Update weights

5 Dec 2016



When you run this code: loss goes down!

5 Dec 2016


Outline


5 Dec 2016

Stanford University

Convolution Layer

Lecture 19 - 93 6 Dec 2016

32

32

3

32x32x3 image

width

height

depth


5 Dec 2016

Stanford University

Convolution Layer

Lecture 19 - 94 6 Dec 2016

32

32

3

5x5x3 filter

32x32x3 image

Convolve the filter with the imagei.e. “slide over the image spatially, computing dot products”


5 Dec 2016

Stanford University

Convolution Layer

Lecture 19 - 95 6 Dec 2016

32

32

3

5x5x3 filter

32x32x3 image

Convolve the filter with the imagei.e. “slide over the image spatially, computing dot products”

Filters always extend the full depth of the input volume


5 Dec 2016

Stanford University

Convolution Layer

Lecture 19 - 96 6 Dec 2016

32

32

3

32x32x3 image5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image(i.e. 5*5*3 = 75-dimensional dot product + bias)


5 Dec 2016

Stanford University

Convolution Layer

Lecture 19 - 97 6 Dec 2016

32

32

3


convolve (slide) over all spatial locations

activation map

1

28

28


Translation Invariance!5 Dec 2016

Stanford University

Convolution Layer

Lecture 19 - 98 6 Dec 2016

32

32

3


convolve (slide) over all spatial locations

activation maps

1

28

28

consider a second, green filter


5 Dec 2016

Stanford University

Convolution Layer

Lecture 19 - 99 6 Dec 2016

32

32

3

Convolution Layer

activation maps

6

28

28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

We stack these up to get a “new image” of size 28x28x6!


5 Dec 2016

Stanford University

Convolutional Neural Networks (CNN)

Lecture 19 - 100 6 Dec 2016

A CNN is a sequence of convolution layers and nonlinearities

32

32

3

CONV,ReLUe.g. 6 5x5x3 filters 28

28

6

CONV,ReLUe.g. 10 5x5x6 filters

CONV,ReLU

….

10

24

24


5 Dec 2016


Case Study: GoogLeNet [Szegedy et al., 2014]

Inception module

ILSVRC 2014 winner (6.7% top 5 error)


5 Dec 2016


Recap


5 Dec 2016

Documents

Lecture: Introduction to Deep Learningvision.stanford.edu/.../files/19_deep_learning.pdf · 2017-12-05 · Deep Learning Learning hierarchical representations from data End-to-end