Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky

Introduction to Deep Learning

Standard feed-forward neural network with 3 hidden layers

A convolutional neural network (AlexNet, Krizhevsky et al ‘12)

Example applications

Image classification(AlexNet, Krizhevsky et al ‘12)

Generating image descriptions (Karpathy et al ‘15)

Translation (Wu et al ‘15)Face generation

(Berthelot et al ‘17)

Very Brief History

2012AlexNet

Deep Neural NetworksSupport Vector Machines,Kernel methods, ….

(90s and earlier:Neural networks)

Seminar plan

Meeting # Speaker Topic

1 David Belius Intro to Machine Learning

2 Marko Thiel Intro to Artificial Neural Networks

3 tba DL Basics: Regularization

4 tba DL Basics: Optimization

5 tba DL Basics: Convolutional neural networks

6 tba DL Basics: Recurrent Neural Networks

7-11 tba Advanced topics

Introduction to ML

ML: Learn to generalize from data● MNIST: 60k handwritten digits, 28x28 grayscale pixels

x

y● CIFAR100: 50k images of objects from 100 classes,

32x32 RGB pixels

x

y● Europarl EN-DE: 1.7m sentence pairs

0 8 3 7

train sunflower elephant cow

Frau Präsidentin, können Sie mir sagen, warum sich dieses Parlament nicht...

Madam President, can you tell me why this Parliament does not….

It is why we cannot say a clear yes.

Deswegen können wir nicht eindeutig ja sagen.

Labeled data

Unlabeled data

Space of “inputs” (e.g. image of handwritten digit)

Space of labels/”outputs” (e.g. digit 0-9)

Data point

Space of data

Data point

Supervised learning

Unsupervised learning

E.g.

– Clustering

– Dimensionality reduction

ML: Learn to generalize from data

Probabilistic model of data

Data set

are iid samples from unknown probability distribution on

Goal of learning: Get information about

Basic ML tasks:Supervised learning

Classification, regression– Predict y from x, i.e. learn

– Often assume Y deterministic function of X

– Then have “truth”

– Seek estimate

Basic ML tasks:Supervised learning

Classification, regression– Predict y from x, i.e. learn

– Often assume Y deterministic function of X

– Then have “truth”

– Seek estimate

– Classification ● is finite set of labels, e.g. ● Want for most

– Regression●

● E.g.: ● Want small for most

– Density estimation● probability density of on● Seek estimate ● E.g. outlier detection:

– Sampling/synthesis● Learn how to simulate a sample from a

probability law that approximates● E.g: Learn to generate an image of a realistic looking

human face

Basic ML tasks:Unsupervised learning

Evaluating performance

● Classification– True error for fixed :

– True error is unknown

– If are iid samples from

is unbiased estimator for true error.

– Warning: Only true if not usedto construct !

Evaluating performance

● Regression– True Mean Squared Error (MSE) for fixed :

– Not known, but unbiased estimate:

if not used to construct !

● Split data set into– Training set (~80%)

– Test set (~20%)

● Construct using training set.● Evaluate performance using test set.

Train data and test data

How to learn

● Non-parametric algorithms– k-nearest neighbours classification, decision trees,

k-means clustering

● Parametric algorithms (“fitting”)– Hypothesis set of potential estimates

parametrized by some number of real parameters

– Error of on training set (regression):

– Learning: find with small error on training setand set

Crucial to restrict class somehow

has zero error on training set.

Example: Linear regression● ,●

● Find that minimize

●

Line of best fit

● Recall: there is a closed form formula for the optimal (least squares, normal equations)

● But also:

is smooth function in → Loss function● Furthermore is convex

– Has unique global minimum

– which can be found by numerical optimization: gradient descent

Example: Linear regression

● Gradient descent– arbitrary (random)

– small (step size/learning rate)

–

● “Always” finds global minimum of smooth, convex loss function

● But typically not fornon-convex function

Example: Linear regression

General recipe of parametric ML

1. Define hypothesis set

2. Define loss smooth loss function

3. Numerically minimize to find estimate

● Traditional ML: make sure is convex to have guarantees for numerical minimization

● Deep Learning/Neural Networks: – highly non-convex

– Somehow, still works. Gradient descent finds “good” minima.

Example: Linear regression● Can fit data that is basically linear:

● Can’t fit other relationships:

● Solution: Make hypothesis set richer!

Example: Polynomial regression● ,●

● Loss

is smooth and convex.

Capacity, overfitting, underfitting

● Capacity: the “richness” of hypothesis set ● Mathematical definitions exist (e.g. VC-dimension)

but often used as intuitive notion● Polynomial regression:

● Too little capacity: can’t fit train data→underfitting● Too much capacity: generalize badly→overfitting

More capacityLess capacity

Polynomial regression:underfitting/overfitting

(Credit: Francois Fleuret, EPFL)


● Underfitting: train error large, test error large● Overfitting: train error small, test error large● Trade-off: must find appropriate level of capacity

for data distribution

More capacityLess capacity

Train error

Test error

Best compromise

Underfitting Overfitting

● Traditional ML: low to moderate capacity● Deep Learning: Enormous capacity.

– Millions of parameters (>> # training examples)

– Still don’t overfit. Why?


Model selection(Hyperparameter selection)

● If test set is used to evaluate performance of different hypothesis sets (different models):– Test error no longer unbiased estimator of true

error!

● Good:

● Still good:

● Bad:


● Solution: Further split train data into– Training set

– Validation set

● Pick algorithm that evaluates the best on validation set

● Report performance on test set

● Good:

Model selection(Hyperparameter selection)

(~80%)

(~20%)


Loss for categorical data

● True error for classification● Empirical training error

is not smooth function! Can’t be used as loss for gradient based numerical optimization.


● Solution: Formulate as really predicting conditional distribution

● Specify probability distribution on , as vector of probabilities

●

● True cond. distribution is

● Seek estimate


● To quantify error made in prediction:– Use relative entropy/ Kullback-Leibler divergence

as distance between prob. measures.

– Recall:

● Error made in prediction for fixed :

● Unknown true error


● Unknown true error● Unbiased estimator

– Here:

● Concretely, estimator equals:

(One-hot encoding)

Loss for categorical data● Concretely, estimator equals:

● The training loss function

is smooth! → Can use numerical optimization● Remarks:

– MSE loss can be justified in terms of predicting a Gaussian dist.

– Not all loss functions derived in such a principled way.

Example: Logistic regression● Hypothesis set

● Loss

is convex in W,b! → Can find global minimum with gradient descent.

● To predict one class: output – Equivalently: output k with largest

Example: Logistic regression● To predict one class: output

– Equivalently: output largest k with largest

● Logistic regression can fit linearly separable data

Decision boundary

Can fit Can’t fit

Example: Logistic regression● If data not linearly separable, can look for

transformation that makes it more so:

● Train on data● = a representation of● Traditional ML: Construct representation by hand● Deep learning: Algorithm finds good representation

during training

Example: Logistic regression● MNIST: 60k handwritten digits, 28x28 grayscale pixels

x

y● Logistic regression on MNIST:

test error rate ~7%

0 8 3 7

Classification: overfitting/underfitting

Good fit

Overfitting

Underfitting

(Credit: Wikipedia)

Data encoding● MNIST: 60k handwritten digits, 28x28 grayscale pixels

x

y● CIFAR100: 50k images of objects from 100 classes,

32x32 RGB pixels

x

y● Europarl EN-DE: 1.7m sentence pairs

0 8 3 7

train sunflower elephant cow

Frau Präsidentin, können Sie mir sagen, warum sich dieses Parlament nicht...

Madam President, can you tell me why this Parliament does not….

It is why we cannot say a clear yes.

Deswegen können wir nicht eindeutig ja sagen.

(One-hot)

(One-hot)

(One-hot)

Feature engineeringRepresentation engineering

● Traditional ML: Use hand-engineered features as inputs to algorithms

● Deep Learning: Feed algorithm raw data (pixels, character level text,….)

Standard data sets:Used as benchmarks

(Credit: https://srconstantin.wordpress.com/) (Credit: Francois Fleuret, EPFL)

Performance on MNIST Performance on CIFAR10

Collective overfitting of test set by ML community

● Recall– Good:

– For heavily used dataset:

● Need new datasets to appear periodically(Credit: Francois Fleuret, EPFL)

Bias-variance tradeoff● Related to underfitting/overfitting● Fix one ● Fit is random variable (depends on trian

data)● Decompose true MSE error at :

Variance

Bias

Irreducible error

● Small variance, high bias: underfit● Large variance, low bias: overfit● Small bias, small variance is hard → Tradeoff

Bias-variance tradeoff


Bias-variance tradeoff

Less capacity

Test error

Underfitting Overfitting

More capacity

Variance

Bias

Test error = variance + bias (+ irreducible error)

● Example: Logistic regression●

● True cond. distribution is

● Seek estimate● Likelihood of train y give train x

● Loss is neg. log-likelihood:

Maximum likelihood interpretation

● Consider model parameters as random with a prior distribution

● Bayes’ rule gives posterior distribution on parameters conditioned on data

Bayesian interpretation

Deep Learning

● Parametric ML with hypothesis class:

General references

● Goodfellow, Bengio, Courville,Deep learning, MIT press, 2016,http://www.deeplearningbook.org

● EPFL course slides and videos– Prof. Francois Fleuret

– https://documents.epfl.ch/users/f/fl/fleuret/www/dlc/

http://www.deeplearningbook.org/

https://documents.epfl.ch/users/f/fl/fleuret/www/dlc/

OrganisationMeeting # Speaker Topic

Today David Belius Intro to Machine Learning

Next week Marko Thiel Intro to Artificial Neural Networks

March 8 Master Student Regularization (Bengio Ch 7)

March 15 Master Student Optimization (Bengio Ch 8)

March 22 Master Student Convolutional neural networks (Bengio Ch 9)

April 12 Master Student Recurrent Neural Networks (Bengio Ch 10)

April 26+ PhD Students Advanced topics

● First four student talks– Master student speakers: e-mail me any preferences

– Set up preliminary meeting with Marko and me

● Optional practical sessions (programming)- E-mail me if interested

● No meeting April 19

Documents

Introduction to Deep Learning - unibas.ch...Introduction to Deep Learning Standard feed-forward neural network with 3 hidden layers A convolutional neural network (AlexNet, Krizhevsky