50
Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky et al ‘12)

Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Introduction to Deep Learning

Standard feed-forward neural network with 3 hidden layers

A convolutional neural network (AlexNet, Krizhevsky et al ‘12)

Page 2: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Example applications

Image classification(AlexNet, Krizhevsky et al ‘12)

Generating image descriptions (Karpathy et al ‘15)

Translation (Wu et al ‘15)Face generation

(Berthelot et al ‘17)

Page 3: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Very Brief History

2012AlexNet

Deep Neural NetworksSupport Vector Machines,Kernel methods, ….

(90s and earlier:Neural networks)

Page 4: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Seminar plan

Meeting # Speaker Topic

1 David Belius Intro to Machine Learning

2 Marko Thiel Intro to Artificial Neural Networks

3 tba DL Basics: Regularization

4 tba DL Basics: Optimization

5 tba DL Basics: Convolutional neural networks

6 tba DL Basics: Recurrent Neural Networks

7-11 tba Advanced topics

Page 5: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Introduction to ML

Page 6: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

ML: Learn to generalize from data● MNIST: 60k handwritten digits, 28x28 grayscale pixels

x

y● CIFAR100: 50k images of objects from 100 classes,

32x32 RGB pixels

x

y● Europarl EN-DE: 1.7m sentence pairs

0 8 3 7

train sunflower elephant cow

Frau Präsidentin, können Sie mir sagen, warum sich dieses Parlament nicht...

Madam President, can you tell me why this Parliament does not….

It is why we cannot say a clear yes.

Deswegen können wir nicht eindeutig ja sagen.

Page 7: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Labeled data

Unlabeled data

Space of “inputs” (e.g. image of handwritten digit)

Space of labels/”outputs” (e.g. digit 0-9)

Data point

Space of data

Data point

Supervised learning

Unsupervised learning

E.g.

– Clustering

– Dimensionality reduction

ML: Learn to generalize from data

Page 8: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Probabilistic model of data

Data set

are iid samples from unknown probability distribution on

Goal of learning: Get information about

Page 9: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Basic ML tasks:Supervised learning

Classification, regression– Predict y from x, i.e. learn

– Often assume Y deterministic function of X

– Then have “truth”

– Seek estimate

Page 10: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Basic ML tasks:Supervised learning

Classification, regression– Predict y from x, i.e. learn

– Often assume Y deterministic function of X

– Then have “truth”

– Seek estimate

– Classification ● is finite set of labels, e.g. ● Want for most

– Regression●

● E.g.: ● Want small for most

Page 11: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

– Density estimation● probability density of on● Seek estimate ● E.g. outlier detection:

– Sampling/synthesis● Learn how to simulate a sample from a

probability law that approximates● E.g: Learn to generate an image of a realistic looking

human face

Basic ML tasks:Unsupervised learning

Page 12: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Evaluating performance

● Classification– True error for fixed :

– True error is unknown

– If are iid samples from

is unbiased estimator for true error.

– Warning: Only true if not usedto construct !

Page 13: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Evaluating performance

● Regression– True Mean Squared Error (MSE) for fixed :

– Not known, but unbiased estimate:

if not used to construct !

Page 14: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

● Split data set into– Training set (~80%)

– Test set (~20%)

● Construct using training set.● Evaluate performance using test set.

Train data and test data

Page 15: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

How to learn

● Non-parametric algorithms– k-nearest neighbours classification, decision trees,

k-means clustering

● Parametric algorithms (“fitting”)– Hypothesis set of potential estimates

parametrized by some number of real parameters

– Error of on training set (regression):

– Learning: find with small error on training setand set

Page 16: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Crucial to restrict class somehow

has zero error on training set.

Page 17: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Example: Linear regression● ,●

● Find that minimize

Line of best fit

Page 18: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

● Recall: there is a closed form formula for the optimal (least squares, normal equations)

● But also:

is smooth function in → Loss function● Furthermore is convex

– Has unique global minimum

– which can be found by numerical optimization: gradient descent

Example: Linear regression

Page 19: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

● Gradient descent– arbitrary (random)

– small (step size/learning rate)

● “Always” finds global minimum of smooth, convex loss function

● But typically not fornon-convex function

Example: Linear regression

Page 20: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

General recipe of parametric ML

1. Define hypothesis set

2. Define loss smooth loss function

3. Numerically minimize to find estimate

● Traditional ML: make sure is convex to have guarantees for numerical minimization

● Deep Learning/Neural Networks: – highly non-convex

– Somehow, still works. Gradient descent finds “good” minima.

Page 21: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Example: Linear regression● Can fit data that is basically linear:

● Can’t fit other relationships:

● Solution: Make hypothesis set richer!

Page 22: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Example: Polynomial regression● ,●

● Loss

is smooth and convex.

Page 23: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Capacity, overfitting, underfitting

● Capacity: the “richness” of hypothesis set ● Mathematical definitions exist (e.g. VC-dimension)

but often used as intuitive notion● Polynomial regression:

● Too little capacity: can’t fit train data→underfitting● Too much capacity: generalize badly→overfitting

More capacityLess capacity

Page 24: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Polynomial regression:underfitting/overfitting

(Credit: Francois Fleuret, EPFL)

Page 25: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Capacity, overfitting, underfitting

● Underfitting: train error large, test error large● Overfitting: train error small, test error large● Trade-off: must find appropriate level of capacity

for data distribution

More capacityLess capacity

Train error

Test error

Best compromise

Underfitting Overfitting

Page 26: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

● Traditional ML: low to moderate capacity● Deep Learning: Enormous capacity.

– Millions of parameters (>> # training examples)

– Still don’t overfit. Why?

Capacity, overfitting, underfitting

Page 27: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Model selection(Hyperparameter selection)

● If test set is used to evaluate performance of different hypothesis sets (different models):– Test error no longer unbiased estimator of true

error!

● Good:

● Still good:

● Bad:

(Credit: Francois Fleuret, EPFL)

Page 28: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

● Solution: Further split train data into– Training set

– Validation set

● Pick algorithm that evaluates the best on validation set

● Report performance on test set

● Good:

Model selection(Hyperparameter selection)

(~80%)

(~20%)

(Credit: Francois Fleuret, EPFL)

Page 29: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Loss for categorical data

● True error for classification● Empirical training error

is not smooth function! Can’t be used as loss for gradient based numerical optimization.

Page 30: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Loss for categorical data

● Solution: Formulate as really predicting conditional distribution

● Specify probability distribution on , as vector of probabilities

● True cond. distribution is

● Seek estimate

Page 31: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Loss for categorical data

● To quantify error made in prediction:– Use relative entropy/ Kullback-Leibler divergence

as distance between prob. measures.

– Recall:

● Error made in prediction for fixed :

● Unknown true error

Page 32: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Loss for categorical data

● Unknown true error● Unbiased estimator

– Here:

● Concretely, estimator equals:

(One-hot encoding)

Page 33: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Loss for categorical data● Concretely, estimator equals:

● The training loss function

is smooth! → Can use numerical optimization● Remarks:

– MSE loss can be justified in terms of predicting a Gaussian dist.

– Not all loss functions derived in such a principled way.

Page 34: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Example: Logistic regression● Hypothesis set

● Loss

is convex in W,b! → Can find global minimum with gradient descent.

● To predict one class: output – Equivalently: output k with largest

Page 35: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Example: Logistic regression● To predict one class: output

– Equivalently: output largest k with largest

● Logistic regression can fit linearly separable data

Decision boundary

Can fit Can’t fit

Page 36: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Example: Logistic regression● If data not linearly separable, can look for

transformation that makes it more so:

● Train on data● = a representation of● Traditional ML: Construct representation by hand● Deep learning: Algorithm finds good representation

during training

Page 37: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Example: Logistic regression● MNIST: 60k handwritten digits, 28x28 grayscale pixels

x

y● Logistic regression on MNIST:

test error rate ~7%

0 8 3 7

Page 38: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Classification: overfitting/underfitting

Good fit

Overfitting

Underfitting

(Credit: Wikipedia)

Page 39: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Data encoding● MNIST: 60k handwritten digits, 28x28 grayscale pixels

x

y● CIFAR100: 50k images of objects from 100 classes,

32x32 RGB pixels

x

y● Europarl EN-DE: 1.7m sentence pairs

0 8 3 7

train sunflower elephant cow

Frau Präsidentin, können Sie mir sagen, warum sich dieses Parlament nicht...

Madam President, can you tell me why this Parliament does not….

It is why we cannot say a clear yes.

Deswegen können wir nicht eindeutig ja sagen.

(One-hot)

(One-hot)

(One-hot)

Page 40: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Feature engineeringRepresentation engineering

● Traditional ML: Use hand-engineered features as inputs to algorithms

● Deep Learning: Feed algorithm raw data (pixels, character level text,….)

Page 41: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Standard data sets:Used as benchmarks

(Credit: https://srconstantin.wordpress.com/) (Credit: Francois Fleuret, EPFL)

Performance on MNIST Performance on CIFAR10

Page 42: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Collective overfitting of test set by ML community

● Recall– Good:

– For heavily used dataset:

● Need new datasets to appear periodically(Credit: Francois Fleuret, EPFL)

Page 43: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Bias-variance tradeoff● Related to underfitting/overfitting● Fix one ● Fit is random variable (depends on trian

data)● Decompose true MSE error at :

Variance

Bias

Irreducible error

● Small variance, high bias: underfit● Large variance, low bias: overfit● Small bias, small variance is hard → Tradeoff

Page 44: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Bias-variance tradeoff

(Credit: Francois Fleuret, EPFL)

Page 45: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Bias-variance tradeoff

Less capacity

Test error

Underfitting Overfitting

More capacity

Variance

Bias

Test error = variance + bias (+ irreducible error)

Page 46: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

● Example: Logistic regression●

● True cond. distribution is

● Seek estimate● Likelihood of train y give train x

● Loss is neg. log-likelihood:

Maximum likelihood interpretation

Page 47: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

● Consider model parameters as random with a prior distribution

● Bayes’ rule gives posterior distribution on parameters conditioned on data

Bayesian interpretation

Page 48: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Deep Learning

● Parametric ML with hypothesis class:

Page 49: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

General references

● Goodfellow, Bengio, Courville,Deep learning, MIT press, 2016,http://www.deeplearningbook.org

● EPFL course slides and videos– Prof. Francois Fleuret

– https://documents.epfl.ch/users/f/fl/fleuret/www/dlc/

Page 50: Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

OrganisationMeeting # Speaker Topic

Today David Belius Intro to Machine Learning

Next week Marko Thiel Intro to Artificial Neural Networks

March 8 Master Student Regularization (Bengio Ch 7)

March 15 Master Student Optimization (Bengio Ch 8)

March 22 Master Student Convolutional neural networks (Bengio Ch 9)

April 12 Master Student Recurrent Neural Networks (Bengio Ch 10)

April 26+ PhD Students Advanced topics

● First four student talks– Master student speakers: e-mail me any preferences

– Set up preliminary meeting with Marko and me

● Optional practical sessions (programming)- E-mail me if interested

● No meeting April 19