46
Short Trip In The Valley of Deep Learning Pantelis Vlachas, Guido Novati Computational Science and Engineering Lab ETH Zürich

Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Short Trip In The Valley of Deep LearningPantelis Vlachas, Guido Novati

Computational Science and Engineering Lab ETH Zürich

Page 2: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Motivation - What is machine learning?

2

Classical Machine Learning

data regression/classification/etc. result

data

feature extraction

feature extraction + regression/classification/etc.

Deep Learning

result

Page 3: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Deep Learning

• Backpropagation

• Backpropagation through time (BPTT)

• Variational Inference (Bayesian)

• GEMM (General Matrix to Matrix Multiplication)

Sophisticated Architectures Algorithms

LeNet of Yann LeCun et al., 1998

LSTM, 1997 GRU, 2014

• Graphical Processing Units (Hardware)

Hardware

Page 4: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Convolutional Neural Networks

• Heavily based on GEMM (General Matrix to Matrix Multiplication)

• Parametric models suited for image processing (classification, object detection, etc.)

• Applications in self-driving cars, robotics, healthcare, physics, image and video recognition, recommender systems, image classification, medical image analysis, natural language processing, financial time series, etc.

LeNet of Yann LeCun et al., 1998

Page 5: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Biological Intuition

• Very roughly biological brains have neurons that activate when they recognize a triggering pattern in their input

• Each unit does “simple” pattern recognition

• Complexity emerges from sheer numbers

Page 6: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Convolutional Model of a part of a Fruit-Fly’s brainJonathan Schneider et al., 2018

Page 7: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Convolutional Neural Network

Page 8: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

What is a Convolution ?

Input is a matrix : dIY

× dIX× dIC

dIX

dIY

dIC

Parameters are a tensor : dKY

× dKX× dIC

× dKC

dKY

dKX

dIC

KC = 1 KC = 2 KC = 3 KC = 4 …(We have filters)dKC

dOX

dOY

dKC

Output is a matrix : dOy

× dOX× dKC

• Mapping an image to another image

• Feature sizes , , can be any numbers

• Parameters are called “filters” or “kernels”

dICdKC

Page 9: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Convolution Operation (filtering, sliding)

• Kernels

• Sliding a kernel along the spatial dimensions , and (Iterating along , and )

• At each position and for each filter in the dimension, we compute the scalar product

between the filter (of size ) and a “patch” of the image of size

• The output of the scalar product is a number which is written in a single color pixel (channel) of the output image

dKY× dKX

× dIC× dKC

IX IY IX IY

KC

dKY× dKX

× dICdKY

× dKX× dIC

Page 10: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

1D Convolution, 1D Filter

1-1201

10-1

-1xxx

+

Page 11: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

1D Convolution, 1D Filter

1-1201

10-1

-1xxx

+ -1

Page 12: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

1D Convolution, 1D Filter

1-1201

10-1

-1

xxx

+-11

Page 13: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Padding

• What if we want to keep the output equal to the input in the spatial dimensions ?

• Size of the image is extended in both directions by and

• Usually zero padding

dIX= dOX

, dIY= dOY

dPYdPX

1-1201

10-1

xxx

+

-1-11

0

0

1

010-1

xxx

+

Page 14: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

dS = 2

Stride

• Convolution does not have to be computed by increments of pixel

• Sride (skip, stepping) and

• Here padding , stride

1dSY

dSX

1 2

1-1201

0

0

10-1

xxx

+ 1

-110-1

xxx

+dS = 2

010-1

xxx

+

Page 15: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

2-D Convolutions

• If the input is

• The filters (kernels) are

• With strides ,

• Padding ,

• The output image has size:

dIY× dIX

× dIC

dKY× dKX

× dIC× dKC

dSYdSX

dPYdPX

dOY=

dIY− dKY

+ 2dPY

dSY

+ 1

dOX=

dIX− dKX

+ 2dPX

dSX

+ 1

dOC= dKC

Page 16: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Convolutional Neural Network

Page 17: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Pooling Operations (Subsampling)

Page 18: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Convolutional Neural Network

Page 19: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Short detour in classification…

19

Page 20: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Classification

ALGORITHM

INPUT

object, image, etc.

OUTPUT

0/1 Binary

Page 21: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Logistic Regression

ALGORITHMW

OUTPUT

y ∈ {0,1}x ∈ ℝdx

INPUT

• Training examples

• Testing examples {(x1, y1), …, (xNtrain, yNtrain)}

{(x1, y1), …, (xNtest, yNtest)}

Page 22: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Logistic Regression

ALGORITHMx ∈ ℝdxW

y = fW(x) ∈ ℝ

x1

x2

σ(x) =1

1 + e−x

• Output is a real number (one class)

• Ideally we want

• Training data ,

• Model 1 : linear regression :

• Model 2 : Sigmoid output layer :

• , ?

• Cross entropy loss:

y = fW(x)y = P(y = 1 | x)

{(x1, y1), …, (xNtrain, yNtrain)} yn ∈ {0,1}y = f w(x) = wTx + b

y = f w(x) = σ(wTx + b)W⋆ = argmin

WL(y, y) L (y, y) =

12 (y − y)2

L (y, y) = − (y log(y) + (1 − y) log(1 − y))

if

• • Maximum log likelihood ! if

• • Maximum log likelihood !

y = 1L (y, y) = − y log(y) = − log P(y = 1 | x)

y = 0L (y, y) = − log(1 − y) = − log P(y = 0 | x)

Page 23: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Logistic Regression

ALGORITHMx ∈ ℝdxW

y = fW(x) ∈ ℝ

x1

x2

σ(x) =1

1 + e−x

• Output is a real number (one class)

• Ideally we want

• Training data ,

• Model 2 : Sigmoid output layer :

• Cross entropy loss:

• How can we classify an object to more than 2 classes ?

y = fW(x)y = P(y = 1 | x)

{(x1, y1), …, (xNtrain, yNtrain)} yn ∈ {0,1}y = f w(x) = σ(wTx + b)

W⋆ = argminW

L(y, y)

L (y, y) = − (y log(y) + (1 − y) log(1 − y))

Page 24: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Classification

ALGORITHM

INPUT

object, image, etc.

OUTPUT

0

Multi-class

100

Dog

Cat

Mouse

Elephant

Page 25: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Back in CNNs…

25

Page 26: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Classification on Images

Input to the network is an image

CNN

{ 0.02 0.03 0.01 0.01 0.70 0.02 0.02 0.01 0.06 0.12 }

Probability that image is 0 1 2 3 4 5 6 7 8 9

Output of the network is the probability of the input image being one of the digits (belonging to one of the target classes)

Page 27: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Classification Layer

• SoftMax Output layer

• Sum of outputs is equal to 1

• Represent probabilities for the target classes

• Loss function ? Cross entropy loss:

• Measure of dissimilarity between distributions

f(xi) =exp(xi)

∑10j=1 exp(xj)

L( f, f ) = −10

∑i=1

fi log f(xi)

f = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0] ⟺

hl o = softmax(Whl + b

x

)

Page 28: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Automatic Feature DetectionHigh level features

Low level features (edges, circels, mesh, text etc.)

Second layer features

Page 29: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Architectures

• LeNet by Yann LeCun et al. in 1998 • Alex-Net by Alex Krizhevsky, et. al. 2012 • VGG Net by Oxford’s Visual Geometry Group 2014 • GoogLeNet by Christian Szegedy, et. al. 2014 • ResNet (Residual Network) by Kaiming He, et. al. 2015 • DenseNet by Gao Huang, et. al. 2016

LeNet of Yann LeCun et al., 1998

Alex-Net of Alex Krizhevsky et al., 2012

Page 30: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Heuristics for Deep Learning

Data Preprocessing • Scaling (e.g. zero mean, unit variance) • Random cropping • Flipping data • PCA whitening • Noise

Initialization of Weights • Scale the weights of each layers bu the

inverse of the square root of number of

input neurons

• Xavier initialization

1Nl

Activation Functions • tanh

• sigmoid

• ReLU

• ELU

Page 31: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Regularization

DropoutFull-connected

Page 32: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Operating on Sequences• In many applications cases, the data have temporal order (language, time series, etc.) • Fully-connected networks, and CNNs do not take into account this feature and have fixed input

and output sizes • Recurrent Neural Networks: networks with feedback loops

Page 33: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Operating on Sequences

xt+1

ht+1

RNN

xt�1

ht�1

RNN

xt

tanh

ht�1 ht

ht

Page 34: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

hT

yT

x3

RNNW h3

y3

x2

RNNW h2h1

Weight Sharing in Time

h0

x1

RNNW

W, b

y1 y2

Page 35: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

hT

yT

x3

RNNW h3

y3

x2

RNNW h2h1

Weight Sharing in Time

h0

x1

RNNW

W, b

y1 y2

y1

L1 = | y1 − y2 |2

y2 y3 yT

L2 L3 LT

L =1T

T

∑t=1

Lt

Page 36: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

L

Backpropagation Through Time (BPTT)

FORWARD PASS - entire sequence, compute loss

BACKWARD PASS - entire sequence, comppute gradients

Page 37: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

L

BACKWARD PASS - on some smaller amount of steps

L

Truncated BPTT

“Carry” hidden state forever

BACKWARD PASS - on some smaller amount of steps

Page 38: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

tanh(W ·)

h1

x1

h0

tanh(W ·)

h2

x2

h1

tanh(W ·)

h3

h2

x3

tanh(W ·)

h4

h3

x4

h4

Vanishing Gradients Problem

• Computing the gradient of the loss w.r.t. involves many factors of and repeated

• In case of a linear activation and no bias, you would have factors like

• The gradient vanishes (explodes) if largest singular value ( )

h0 W tanh

W(W(…(Wh0)))< 1 > 1

Page 39: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Gating Architectures

ot

ht�1�

got

ht

ht

ct�1 ct�

gft

git

+

tanh

�tanh

ct

Long Short-Term Memory Cell

ot

ht�1

rt

1 � ·

zt

+ht

ht

tanh

ht�

Gated Recurrent Unit

Page 40: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Gating Architectures

got

gft

git

+

tanh

�tanh

ct

got

gft

git

+

tanh

�tanh

ct

got

gft

git

+

tanh

�tanh

ct

got

gft

git

+

tanh

�tanh

ct

x1 x2 x3 x4

h1 h2 h3 h4

C0 C1 C2 C3 C4

h1 h2 h3h0 h4

Uninterrupted gradient flow !

Page 41: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

RNN structure

One-To-One e.g. classification

Many-To-One e.g. sentiment

analysis

One-To-Many e.g. Image captioning,

video generation

Many-To-Many e.g. Machine translation,

time-series prediction

Page 42: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Prediction of Chaotic Dynamics

• Forecasting the state of the Kuramoto-Sivashinsky equation

∂u∂t

= − v∂4u∂x4

−∂2u∂x2

− u∂u∂x

• RNNs can be chaotic !

• They are dynamical systems

Page 43: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Word embeddings

• Words can be represented by numbers (vectors) that encode semantic meaning • E.g. Word2Vec • Input: LARGE CORPUS OF TEXT • Learns a vector space where each word is assigned a vector • How ? Predict a word (target) from its neighboring words (context) or vice versa • Encodes context information

x1

x2

France

ParisGreece

Athens (closest word)

Page 44: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Word embeddings

Page 45: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Applications The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches, 2018

Object detection Object localisation Image/Video segmentation

Autonomous driving

Brain cancer detection

Scin cancer recognition

Speech recognition Machine translation Image/Video captioning Medicine/Biology

Page 46: Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels • Sliding a kernel along the spatial dimensions , and (Iterating along , and

Outlook

46

Why Deep Learning ? • Universal approach for learning problems • Robust approach, does not require “much” expert

knowledge • Generalization, Scalability

Challenges ? • Big data and scalability • Generalization, transfer learning, multi-task learning • Generate new “artificial” datasets, for applications where data is scarce (Generative

models) • Understaning/Explainable models, incorporating physics • Causality and not plain pattern recognition/correlations • Energy efficient implementations on mobiles/FPGAs, etc.

Amount of data

Performance

Classical ML

Deep Learning