ICSD Summer School 2016 Data Science - Week 4 …people.inf.elte.hu/jehad/Aldahdooh/ISCD-FR/f1.pdf · ... Convolutional Neural Networks ... CS231n: Convolutional Neural Networks for

ICSD Summer School 2016Data Science - Week 4

Gabriella Contardo

LIP6, University Pierre et Marie Curie, Paris, France

August 8, 2016

1/9 G.Contardo Roscoff ICSD Summer School 2016 - Data Science and Machine Learning

Outline of the week

Course 1 : Reminders of the learning paradigm, neuralnetworks / multi layer perceptron.Course 2 : Deep learning : Convolutional Neural NetworksCourse 3 : Tips on deep-learning - Unsupervised Learning :Clustering (K-Means), EMCourse 4 : Unsupervised learning - PCA, Matrix Factorization andRecommender systemsCourse 5 : Unsupervised learning with (deep) neural networks :Auto-encoders, RNN. Word embeddings.


References

On-line course material for todayThanks to:

Patrick Gallinari - Professor at UPMC - Course ”ApprentissageStatistique”Fei-Fei Li’s course at Stanford : CS231n: Convolutional NeuralNetworks for Visual Recognition (Lecture : introduction to neuralnets, backpropagation)Feel free to read their supplementary notes available on thewebsite :http://cs231n.stanford.edu/syllabus.html

Also interesting (generally speaking):Machine Learning course by Andrew Ng on CourseraLectures by Nando De Freitas (Oxford) - videos available


Outline of the day

Reminders and definitions about the learning problem(s)Brief history of machine learning (ML)/neural networksPerceptron→ multi layer perceptron (=neural network)What is inside ?How do I learn this ?


Learning with examples

3 main « components » Data {z1, ..., zN} Machine or model Fθ

Criterion C (learning and evaluation) Goal

Extract information from data Relevant information

For the task we study For other data of the same type

Utilisation Inference on new data (=examples)

Different learning family : Supervised Unsupervised Semi supervised Reinforcement

Slides from AS course of P. Gallinari

Examples of some learning tasks/problems

Speach / Writing Data : (signal, (transcription)) Goal : recognize signal Criterion : # words accuratly recognized

Driving autonomous car Data : (images road, (command steering wheel)) e.g. S. Thrun

Darpa Challenge + Google car Goal : keep on the road Criterion : distance drived

Textual Information Data : (text + request, (relevant information)) – text corpus But : return information matching the request Critère : precision and recall



User modelization Data : (User log activity) Goal : Modelize / analize the user’s behavior

Examples : target customer, personnalized ad, recommender system (e.g movies), personnal assistant (e.g Google now, cortana) Google Now, etc

Critère : ? Evaluation : ? Example Google Now

Google Now keeps track of searches, calendar events, locations, and travel patterns. It then synthesizes all that info and alerts you—either through notifications in the menu bar or cards on the search screen—of transit alerts for your commute, box scores for your favorite sports team, nearby watering holes, and more. You can assume it will someday suggest a lot more.



More complex: Translation (side note : novel approaches are promising) Understanding scene (visual) or text : extract some « sens » Meta learning, transfer learning, « learning to learn » Discovering (« curiosity »), e.g using database or web

Data : information representation ? But ?? Critère ?? Evaluation ??


Data : diversity


4 « types » of learning problems

(Machine) Learning provides tools to tackle generic problems

Transverse to wide variety of applications (finance, advertising, computer vision,...)

4 familes of learning Supervised Unsupervised Semi-supervised Reinforcement

Each family handle a particular set of generic problems Example of such set

Supervised : classification, regression, ranking


Supervised Learning

7

Training Data : set of couple (input, expected output) Goal : learn to associate input to output

Expect the model to generalize well : i.e good prediction on example (input) outside of the dataset used to learn but with same (or close) data origin

Utilisation : classification, regression, ordonnancement


Unsupervised Learning

Training Data Only input, no desired output

Goal Extract/Detect patterns in data, learn some structure in data

Similarities, underlying factors linking the data, ...

Utilization Density estimation, clustering, latent factors, feature learning,

generative models…


Semi-supervised learning

Training Data Labeled – a few Unlabeled (=no output) – a lot

Goal Extract information (e.g pattern, cf unsupervised) from

unlabeled examples to healp labelizing. Learning conjointly with the two sets of examples.

Utilization Huge datasets where labels are costly


Reinforcement Learning Training data

Environement with states (inputs) , actions that get the system to a state

to another, qualitative desired output

Paradigme Learning from exploring the environment, guided by reward (e.g

for good prediction) Trade-off explore/exploit

Utilisation Guiding, sequential decision, robotique, games with 2 (or more)

players … Example backgammon (TD Gammon Thesauro 1992)

Trained on 1.5 M games

Play against itself More recently : atari deepmind, alpha go (mix deep learning and

reinforcement learning)


Brief history of neural networks

43 Mc Culloch & Pitts : artificial neuron "A logical calculus of the ideas immanent in nervous activities"

40 – 45 Wiener (USA) Kolmogorov (URSS) Türing (UK)

48 – 50 Von Neuman : cellular automaton 49 Hebb’s rule : neuroscience / computer science –

adaptation of neurons during learning


Brief history of neural networks 55 – 60

Rosenblatt : Perceptron

Widrow - Hoff : Adaline

69 : Minsky / AI winter 70 – 80 (Auto)associative memory, ART, SOM ... 80 – 85

Non linear networks Hopfield networks, Boltzmann Machine Multi Layer Perceptron

AI winter 2 2006-...

Deep neural networks, restricted Boltzmann machines,… Representation learning


Ok now back to business

We still don’t know what is a neural network… How do I build it ? What is inside ? How does

this work ?− « Intuition » with playground.tensorflow.org− Forward pass and stuff

How do I learn this ?− Gradient descent ! Keep calm and backpropagate

Food for thought

Playground.tensorflow.org Perceptron :

A neural net : MultiLayerPerceptron

A neural net : MultiLayerPerceptron Still linear prediction….

A neural net : MultiLayerPerceptron Kernel trick

A neural net : MultiLayerPerceptron Kernel trick After a few learning steps…

A neural net : MultiLayerPerceptron Non-linear activation function

A neural net : MultiLayerPerceptron Non-linear activation function After a few learning steps

A neural net : MultiLayerPerceptron

Other activation functions besides tanH and Sigmoid: Rectifier linear units

g(x) = max(0, b + w.x) Rectifier units allow to draw activations to 0 (used for sparse

representations) Maxout

g(x) = max_i (b_i + w_i . x) Generalizes the rectifier unit There are multiple weight vectors for each unit

Softmax Used for classification with a out of p coding (p classes)

Ensures that the sum of predicted outputs sums to 1

g(x) =


Learning neural networks

Finding the weights (W) that give best results to match inputs to expected outputs.

Learning : Minimising the function cost E(W, {X,Y})

How ? Gradient descent ! But how ? Backpropagation of the gradient !

− NB : a gradient :

Reminder – Gradient algorithm

Goal Optimize a cost function E(w) with parameters w Principle :

Initialize wIterate until convergence w(t+1) = w(t) + ε(t)D(t)

Direction of descent D is computed from local information on the cost function E(w), i.e 1st or 2d order approximation

Example batch, with training dataset D = {(x1,y1),...(xN, yN)} :

Init w0

Iterate w(t+1) = w(t) – ε ∇WE(w(t))

Where E = Σi=1..Nc(xi)

Forward

(credit image N.Baskiotis course « ARF » UPMC)

Learning Algorithm

Feed forward an input (or more)− Compute predicted output

Compute error depending on cost function E, desired output and predicted output

Compute gradients of all weights (see next slide)

Update weights regarding gradient Repeat until stopping criterion

Backpropagation

Goal of gradient : update weights on the « right » direction (to lower the error) : − Which weight is responsible of the error and on

what « amount » ?

Backpropagation principle : gradient of a layer’s weights computed using « deltas » from the next layer ( : chain rule)

Neural Networks - Backpropagation

δout

δparamδnext

Update layer k ...

outk1

outk2

outkn

δnext

δoutk

1

δnext

δoutk

2

δnext

δoutkn

δout

δin

k−11

δnext

δout

δin

k−1m

δnext

...

ink−11

ink−1m

What does it look like if we compute backpropagation on thepreviously seen network with an example x i = [1,−1], y i = −1and a mean square error ?


Neural Networks - Backpropagation

ModularityDo not need to compute yourself all the gradients : build thenetwork as layers of modules (OOP)Distinguish activation function modules and parametric modules(weights layers)Basically, a module should have (at least) three functions :

Forward : given an input, predict the outputBackward : given an input, and δnext (gradient backpropagated fromnext module), return the gradient δto prev to backpropagate inprevious module: δto prev = δout

δinδnext

Update : given input and δnext , update the parameters if needed(e.g if module ”tanh”, not necessary because no parameters):compute gradient δout

δparamsδnext .

Rely on matrices computation.


Brain neural cell / Neural network cell

(credit image Karpathy’s course CS231n oxford)(credit image Karpathy’s course CS231n oxford)

Criterion – Loss functions Different cost functions can

be used, depending on the problem or the model

LMSE Regression

(but often used in classification)

Classification, Hinge, logistique

Hinge, logistique approximate classification error

Notations : d=desired output, y=predicted

output. Classif : d=[-1,1]

Figure from Bishop 2006

Abciss : z= y.d (desired * predicted output)

Criterion – Loss functions Different cost functions can

be used, depending on the problem or the model

LMSE Regression

(but often used in classification)

Classification, Hinge, logistique : approximate classification error

Also : cross-entropy/log-likelihood.

Notations : d=desired output, y=predicted

output. Classif : d=[-1,1]

Figure from Bishop 2006

Abciss : z= y.d (desired * predicted output)

Misc.

Batch / mini-batch / stochastic gradient descent :

→ Batch : compute gradient on all examples at each iteration. Good when convex loss.

→ Stochastic : compute on a randomly picked example at each iteration. Provides variance to get out local minima.

→Mini-batch : compute on a smaller subset of examples.

Back to playground : train loss vs test loss

→ Generalization / Overfitting

→ Regularization

Overfitting

Regularization

Regularization

Neural Networks Architectures

(credit image Karpathy’s course CS231n oxford)





Neural Networks

Questions ?

How to implement them ? Different (recent) libraries (provides GPU implem) :

– Torch (lua)

– Theano (python)

– Caffe (python/C++)

– TensorFlow (python)

– ….

Exercises

Consider a network with one hidden layer. We note fw (x) the outputof the network for input x (w are the parameters of the network).An input x i = {x i

j }j=1,...,d , its label y i , a training datasetD = {(x i , y i)}1,...,NWeights from input to hidden layer are noted w0 = {wjh}j=1,...,d ,h=1,...,HWeights from hidden layer to output layer arew1 = {whk}h=1,...,H,k=1,...,KActivation function for both layers are noted g1,g2

QuestionsHow many neurons are there in the networks ? How many outputs? Draw the network. Explain why there can be a number of outputsuperior to 1.Write the output fw (x) with regard to the components of x and w


Exercises

QuestionsHow many neurons are there in the networks ? How many outputs? Draw the network. Explain why there can be a number of outputsuperior to 1.Write the output fw (x) with regard to the components of x and wWrite the cost (mean square) w.r.t the training set D. Write itstheoretical formulation (using expected value).


Exercises

QuestionsWrite the output fw (x) with regard to the components of x and w :fw (x)k = g2(

∑h whkg1(

∑i wihxi))

Write the cost (mean square) w.r.t the training set D. Write itstheoretical formulation (using expected value).R(fw ) = Ex ,y ((fw (x)− y)2)(= 1/n

∑i(fw (x

i)− y i)2)


Documents

ICSD Summer School 2016 Data Science - Week 4 …people.inf.elte.hu/jehad/Aldahdooh/ISCD-FR/f1.pdf · ... Convolutional Neural Networks ... CS231n: Convolutional Neural Networks for