38
Introduction MLPs Deeper in MLPs DNNs in brief eseaux de Neurones Profonds, Apprentissage de Repr´ esentations Thierry Arti` eres ECM, LIF-AMU December 14, 2017 T. Arti` eres (ECM, LIF-AMU) Deep Learning December 14, 2017 1 / 35

Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Reseaux de Neurones Profonds, Apprentissage de Representations

Thierry Artieres

ECM, LIF-AMU

December 14, 2017

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 1 / 35

Page 2: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

1 Introduction

2 MLPs

3 Deeper in MLPs

4 DNNs in brief

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 2 / 35

Page 3: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Outline

1 Introduction2 MLPs3 Deeper in MLPs4 DNNs in brief

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 3 / 35

Page 4: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

History

Key dates1980s : Back-propagation [Rumelhart and Hinton]1990s : Convolutional Networks [LeCun and al.]1990s: Long Short Term Memory networks [Hochreiter and Schmidhuber]2006 : Paper on Deep Learning in Nature [Hinton and al.]2012 : Imagenet Challenge Win [Krizhevsky, Sutskever, and Hinton]2013 : First edition of ICLR2013 : Memory networks [Weston and al.]2014 : Adversarial Networks [Goodfelow and al.]2014 : Google Net [Szegedy and al.]2015 : Residual Networks [He et al.]

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 4 / 35

Page 5: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Deep Learning today

Spectaculary breakthroughs - fast industrial transferImages, Videos, Audio, Speech, TextsSuccessful setting

Structured data (temporal, spatial...)Huge volumes of datasHuge models (millions of parameters)

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 5 / 35

Page 6: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

The Graal

[Gallant et al., ...]T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 6 / 35

Page 7: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

The key: features

[Honglak et al.]T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 7 / 35

Page 8: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Annotating real visual scenes

[Farabetr et al., 2012]

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 8 / 35

Page 9: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Automatic captioning

[Honglak et al., 2014]

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 9 / 35

Page 10: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Outline

1 Introduction2 MLPs3 Deeper in MLPs4 DNNs in brief

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 10 / 35

Page 11: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

A single Neuron

One Neuron

Elementary computation

activation = wT .x =∑

j

wj xj + w0

output = g(a(x))

Non linearity : g

Sigmoide, Hyperbolic tangent, GaussianRectified Linear Unit (RELU)

f (x) = 0 if x ≤ 0= x otherwise

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 11 / 35

Page 12: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Multi Layer Perceptron (MLP)

Structure

Organization in successive layersInput layerHidden layersOutput layer

Function implemented by a MLP

g(W o .g(W hx))

Inference: Forward propagation from inputto output layer

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 12 / 35

Page 13: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

MLP : Forward propagation

Forward propagation of activities, for an input example xFill the input layer with x : h0 = xIterate from the first hidden layer to the last one

hl = W l × hl−1

hl = g(hl )

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 13 / 35

Page 14: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

MLP Usage for Regression

Notation

yij : ideal output of the j th neuron of the output layer when input is example numberioij : real output of the j th neuron of the output layer when input is example number iN : number of samplesO number of outputs of the model = size of the output layer

TrainingCriterion:

Mean Squared Error 1N∑N

i=1

∑Oj=1 ‖yij − oij‖2

InferenceForward propagation from the input layer to the output layerOutput: (oij )j=1..O

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 14 / 35

Page 15: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

MLP Usage for Classification

TrainingOne-hot encoding of outputs: As many outputs as there are classesMSE criterion as for Regression problemsCross Entropy criterion

transformation of outputs sij in a probability distribution

Softmax : pij = exp−oij∑Ok=1

exp−oik

New ouputs of the model : pij = output of the jth neuron of the output layer when input isexample number i

Criterion:Cross-entropy − 1

N

∑Ni=1

∑Oj=1

yij log(pij )

TrainingForward propagation from the input layer to the output layerDecision based on the maximum value amongst output cells c = argmaxj=1..Opij

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 15 / 35

Page 16: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Learning a MLP

Learning as an optimization problemObjective function of parameters set w for a given training set T

C(w) =F (w) + R(w)

=∑

(x,y)∈T

Lw (x , y ,w) + ||w ||2

Gradient descent optimization: w = w − ε ∂C(w)∂w

BackpropagationUse chain rule for computing derivativeof the loss with respect to all weights inthe NN

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 16 / 35

Page 17: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Gradient Descent Optimization

Gradient Descent OptimizationInitialize Weights (Randomly)Iterate (till convergence)

Restimate wt+1 = wt − ε ∂C(w)∂w |wt

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 17 / 35

Page 18: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Gradient Descent: Tuning the Learning rate

Two classes Classification problem

Weight trajectory for two different gradientstep settings.

Images from [LeCun et al.]

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 18 / 35

Page 19: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Gradient Descent: Tuning the Learning rate

Effect of learning rate settingAssuming the gradient direction is good, there is an optima value fir the learning rateUsing a smaller value slows the convergence and may prevent from convergingUsing a bigger value makes convergence chaotic and may cause divergence

Images from [LeCun et al.]

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 19 / 35

Page 20: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Gradient Descent: Stochastic, Batch and mini batchs

Objective : Minimize C(w) =∑

i=1..N Lw (i) with Lw (i) = Lw (x i , y i ,w)

Batch vs Stochastic vs MinibatchsBatch gradient descent

Use ∇C(w)Every iteration all samples are used to compute the gradient direction and amplitude

Stochastic gradientUse ∇Lw (i)Every iteration one sample (randomly chosen) is used to compute the gradientdirection and amplitudeIntroduce randomization in the process.Minimize C(w) by minimizing parts of it sucessivelyAllows faster convergence, avoiding local minima etc

MinibatchUse ∇

∑few j Lw (j)

Every iteration a batch of samples (randomly chosen) is used to compute the gradientdirection and amplitudeIntroduce randomization in the process.Optimize the GPU computation ability

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 20 / 35

Page 21: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Gradient Computation: Chain rule

Gradient of a function

z = 2× f (x + 3× y) + 6× g(5× x)× h(y)

⇒∂z∂x|x,y = 2× f ′(x + 3× y) + 30× g ′(5× x)× h(y)

Equivalent computation with the Chain rule

Set a(x) = f (x + 3 × y) and b(x, y) = g(5 × x)

⇒ z = 2 × a(x) + 6 × b(x) × h(y)

⇒∂z

∂x|x,y =

∂z

∂a|x,y ×

∂a

∂x|x,y +

∂z

∂b|x,y ×

∂b

∂x|x,y

With:

∂y

∂a|x,y = 2 and

∂a

∂x|x,y = f ′(a × x + 3 × y)

∂y

∂b|x,y = 6 × h(y) and

∂b

∂x|x,y = 5 × g′(5 × x)

∂a

∂x|x,y = g′(a × x)

∂b

∂x|x,y = 5 × g′(5 × x)

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 21 / 35

Page 22: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Gradient computation in MLPs: Stochastic case

Notations

Activation function on every layer: g — Number of layer : LActivity of neuron i in layer l , al

i — Output of neuron i in layer l , hli = g(al

i ), andoL

i = g(aLi )

Weight from a neuron j of layer l − 1 to neuron i in layer l : w lij

Example considered for computing gradient (x , y)Squarred loss : C(w) = ‖oL − y‖2

Gradient wrt. last layer weights

Gradient wrt cell’s ouput ∂C(w)∂oL

i= 2(oL

i − yi )

Gradient wrt cell’s activity δLi = ∂C(w)

∂aLi

= ∂C(w)∂oL

i

∂oLi

∂aLi

= 2(oLi − yi )g ′(aL

i )

Gradient wrt weights arriving to output cells

∂C(w)∂wL

ij=∂C(w)∂aL

i

∂aLi

∂wLij

= δLi × hL−1

j

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 22 / 35

Page 23: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Gradient computation in MLPs: Stochastic case (continues)

Gradient wrt. last hidden layer (LHL) weights

Gradient wrt LHL cell’s activity δL−1j = ∂C(w)

∂aL−1j

=∑

i∂C(w)

∂aLi

∂aLi

∂aL−1j

=∑

i δLi wL

ij g ′(aL−1j )

Gradient wrt weights arriving to a LHL cell∂C(w)∂wL−1

jk= δL−1

j × hL−2k

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 23 / 35

Page 24: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Gradient computation in MLPs

Forward propagation of activities, for an input example x

Fill the input layer with x : h0 = xIterate from the first hidden layer to the last one

hl = W l × hl−1

hl = a(hl )

Backward computation of the error

Compute the output error δL

Iterate from the last hidden layer to the first oneCompute δL from δL−1

Computing gradient

For each weight w ljk of every layer compute the gradient using δl

j and ol−1k

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 24 / 35

Page 25: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Lots of tricks to favor convergence

And more...Weight InitializationGradient step setting...⇒ Despite appearances NN are still notfully usable by non experts

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 25 / 35

Page 26: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Outline

1 Introduction2 MLPs3 Deeper in MLPs4 DNNs in brief

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 26 / 35

Page 27: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

A single Neuron

One Neuron

Elementary computation

activation = wT .x =∑

j

wj xj + w0

output = g(a(x))

Non linearity : g

Sigmoide, Hyperbolic tangent, GaussianRectified Linear Unit (RELU)

f (x) = 0 if x ≤ 0= x otherwise

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 27 / 35

Page 28: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

A single Neuron

One Neuron

Elementary computation

activation = wT .x =∑

j

wj xj + w0

output = g(a(x))

Non linearity : g

Sigmoide, Hyperbolic tangent, GaussianRectified Linear Unit (RELU)

f (x) = 0 if x ≤ 0= x otherwise

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 27 / 35

Page 29: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

What a MLP may compute

What does a hidden neuronDivides the input space in two

Combining multiple hidden neuronsAllows identifying complexareas of the input spaceNew (distributed)representation of the input

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 28 / 35

Page 30: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Distributed representations

Might be much more efficient than non distributed ones

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 29 / 35

Page 31: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

MLP = Universal approximators

One layer is enough !Theorem [Cybenko 1989]: Let φ(·) be a nonconstant, bounded, andmonotonically-increasing continuous function. Let Im denote the m-dimensional unithypercube [0, 1]m. The space of continuous functions on Im is denoted by C(Im).Then, given any ε > 0, there exists an integer N, such that for any functionf ∈ C(Im), there exist real constants vi , bi ∈ R and real vectors wi ∈ Rm, wherei = 1, · · · ,N, such that we may define:

F (x) =N∑

i=1

viφ(

wTi x + bi

)as an approximate realization of the function f where f is independent of φ ; that is: |F (x)− f (x)| < ε for all x ∈ Im. In other words, functions of the form F (x) aredense in C(Im).Existence theorem onlyMany reasons for not getting good results in practice

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 30 / 35

Page 32: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Outline

1 Introduction2 MLPs3 Deeper in MLPs4 DNNs in brief

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 31 / 35

Page 33: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

What are deep models ?

NNs with more than one hidden layer !A series of hidden layers

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 32 / 35

Page 34: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

What are deep models ?

NNs with more than one hidden layer !Computes a complex function of the input

y = g(W k × g(W k−1 × g(...g(W 1 × x))))

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 32 / 35

Page 35: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

What are deep models ?

NNs with more than one hidden layer !Computes new representations of the input

hi (x) = g(W i × hi−1(x))

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 32 / 35

Page 36: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Machine Learning vs. Deep Learning

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 33 / 35

Page 37: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Feature hierarchy : from low to high level

What feature hierarchy means ?Low-level features are shared among categoriesHigh-level features are more global and more invariant

([Krizhevsky and al., 2012])

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 34 / 35

Page 38: Thierry Arti`eres - lis-lab.fr · Gradient computation in MLPs: Stochastic case Notations Activation function on every layer: g — Number of layer : L Activity of neuron i in layer

Introduction MLPs Deeper in MLPs DNNs in brief

Examples of architectures

AlexNet [Krizhevsky and al., 2012] (top) and NetworkInNetwork [Lin and al.,2013](bottom)

T. Artieres (ECM, LIF-AMU) Deep Learning December 14, 2017 35 / 35