Upload
buihanh
View
218
Download
0
Embed Size (px)
Citation preview
ICSD Summer School 2016Data Science - Week 4
Gabriella Contardo
LIP6, University Pierre et Marie Curie, Paris, France
August 8, 2016
1/9 G.Contardo Roscoff ICSD Summer School 2016 - Data Science and Machine Learning
Outline of the week
Course 1 : Reminders of the learning paradigm, neuralnetworks / multi layer perceptron.Course 2 : Deep learning : Convolutional Neural NetworksCourse 3 : Tips on deep-learning - Unsupervised Learning :Clustering (K-Means), EMCourse 4 : Unsupervised learning - PCA, Matrix Factorization andRecommender systemsCourse 5 : Unsupervised learning with (deep) neural networks :Auto-encoders, RNN. Word embeddings.
2/9 G.Contardo Roscoff ICSD Summer School 2016 - Data Science and Machine Learning
References
On-line course material for todayThanks to:
Patrick Gallinari - Professor at UPMC - Course ”ApprentissageStatistique”Fei-Fei Li’s course at Stanford : CS231n: Convolutional NeuralNetworks for Visual Recognition (Lecture : introduction to neuralnets, backpropagation)Feel free to read their supplementary notes available on thewebsite :http://cs231n.stanford.edu/syllabus.html
Also interesting (generally speaking):Machine Learning course by Andrew Ng on CourseraLectures by Nando De Freitas (Oxford) - videos available
3/9 G.Contardo Roscoff ICSD Summer School 2016 - Data Science and Machine Learning
Outline of the day
Reminders and definitions about the learning problem(s)Brief history of machine learning (ML)/neural networksPerceptron→ multi layer perceptron (=neural network)What is inside ?How do I learn this ?
4/9 G.Contardo Roscoff ICSD Summer School 2016 - Data Science and Machine Learning
Learning with examples
3 main « components » Data {z1, ..., zN} Machine or model Fθ
Criterion C (learning and evaluation) Goal
Extract information from data Relevant information
For the task we study For other data of the same type
Utilisation Inference on new data (=examples)
Different learning family : Supervised Unsupervised Semi supervised Reinforcement
Slides from AS course of P. Gallinari
Examples of some learning tasks/problems
Speach / Writing Data : (signal, (transcription)) Goal : recognize signal Criterion : # words accuratly recognized
Driving autonomous car Data : (images road, (command steering wheel)) e.g. S. Thrun
Darpa Challenge + Google car Goal : keep on the road Criterion : distance drived
Textual Information Data : (text + request, (relevant information)) – text corpus But : return information matching the request Critère : precision and recall
Slides from AS course of P. Gallinari
Examples of some learning tasks/problems
User modelization Data : (User log activity) Goal : Modelize / analize the user’s behavior
Examples : target customer, personnalized ad, recommender system (e.g movies), personnal assistant (e.g Google now, cortana) Google Now, etc
Critère : ? Evaluation : ? Example Google Now
Google Now keeps track of searches, calendar events, locations, and travel patterns. It then synthesizes all that info and alerts you—either through notifications in the menu bar or cards on the search screen—of transit alerts for your commute, box scores for your favorite sports team, nearby watering holes, and more. You can assume it will someday suggest a lot more.
Slides from AS course of P. Gallinari
Examples of some learning tasks/problems
More complex: Translation (side note : novel approaches are promising) Understanding scene (visual) or text : extract some « sens » Meta learning, transfer learning, « learning to learn » Discovering (« curiosity »), e.g using database or web
Data : information representation ? But ?? Critère ?? Evaluation ??
Slides from AS course of P. Gallinari
Data : diversity
Slides from AS course of P. Gallinari
4 « types » of learning problems
(Machine) Learning provides tools to tackle generic problems
Transverse to wide variety of applications (finance, advertising, computer vision,...)
4 familes of learning Supervised Unsupervised Semi-supervised Reinforcement
Each family handle a particular set of generic problems Example of such set
Supervised : classification, regression, ranking
Slides from AS course of P. Gallinari
Supervised Learning
7
Training Data : set of couple (input, expected output) Goal : learn to associate input to output
Expect the model to generalize well : i.e good prediction on example (input) outside of the dataset used to learn but with same (or close) data origin
Utilisation : classification, regression, ordonnancement
Slides from AS course of P. Gallinari
Unsupervised Learning
Training Data Only input, no desired output
Goal Extract/Detect patterns in data, learn some structure in data
Similarities, underlying factors linking the data, ...
Utilization Density estimation, clustering, latent factors, feature learning,
generative models…
Slides from AS course of P. Gallinari
Semi-supervised learning
Training Data Labeled – a few Unlabeled (=no output) – a lot
Goal Extract information (e.g pattern, cf unsupervised) from
unlabeled examples to healp labelizing. Learning conjointly with the two sets of examples.
Utilization Huge datasets where labels are costly
Slides from AS course of P. Gallinari
Reinforcement Learning Training data
Environement with states (inputs) , actions that get the system to a state
to another, qualitative desired output
Paradigme Learning from exploring the environment, guided by reward (e.g
for good prediction) Trade-off explore/exploit
Utilisation Guiding, sequential decision, robotique, games with 2 (or more)
players … Example backgammon (TD Gammon Thesauro 1992)
Trained on 1.5 M games
Play against itself More recently : atari deepmind, alpha go (mix deep learning and
reinforcement learning)
Slides from AS course of P. Gallinari
Brief history of neural networks
43 Mc Culloch & Pitts : artificial neuron "A logical calculus of the ideas immanent in nervous activities"
40 – 45 Wiener (USA) Kolmogorov (URSS) Türing (UK)
48 – 50 Von Neuman : cellular automaton 49 Hebb’s rule : neuroscience / computer science –
adaptation of neurons during learning
Slides from AS course of P. Gallinari
Brief history of neural networks 55 – 60
Rosenblatt : Perceptron
Widrow - Hoff : Adaline
69 : Minsky / AI winter 70 – 80 (Auto)associative memory, ART, SOM ... 80 – 85
Non linear networks Hopfield networks, Boltzmann Machine Multi Layer Perceptron
AI winter 2 2006-...
Deep neural networks, restricted Boltzmann machines,… Representation learning
Slides from AS course of P. Gallinari
Ok now back to business
We still don’t know what is a neural network… How do I build it ? What is inside ? How does
this work ?− « Intuition » with playground.tensorflow.org− Forward pass and stuff
How do I learn this ?− Gradient descent ! Keep calm and backpropagate
Food for thought
Playground.tensorflow.org Perceptron :
A neural net : MultiLayerPerceptron
A neural net : MultiLayerPerceptron Still linear prediction….
A neural net : MultiLayerPerceptron Kernel trick
A neural net : MultiLayerPerceptron Kernel trick After a few learning steps…
A neural net : MultiLayerPerceptron Non-linear activation function
A neural net : MultiLayerPerceptron Non-linear activation function After a few learning steps
A neural net : MultiLayerPerceptron
Other activation functions besides tanH and Sigmoid: Rectifier linear units
g(x) = max(0, b + w.x) Rectifier units allow to draw activations to 0 (used for sparse
representations) Maxout
g(x) = max_i (b_i + w_i . x) Generalizes the rectifier unit There are multiple weight vectors for each unit
Softmax Used for classification with a out of p coding (p classes)
Ensures that the sum of predicted outputs sums to 1
g(x) =
Slides from AS course of P. Gallinari
Learning neural networks
Finding the weights (W) that give best results to match inputs to expected outputs.
Learning : Minimising the function cost E(W, {X,Y})
How ? Gradient descent ! But how ? Backpropagation of the gradient !
− NB : a gradient :
Reminder – Gradient algorithm
Goal Optimize a cost function E(w) with parameters w Principle :
Initialize wIterate until convergence w(t+1) = w(t) + ε(t)D(t)
Direction of descent D is computed from local information on the cost function E(w), i.e 1st or 2d order approximation
Example batch, with training dataset D = {(x1,y1),...(xN, yN)} :
Init w0
Iterate w(t+1) = w(t) – ε ∇WE(w(t))
Where E = Σi=1..Nc(xi)
Forward
(credit image N.Baskiotis course « ARF » UPMC)
Learning Algorithm
Feed forward an input (or more)− Compute predicted output
Compute error depending on cost function E, desired output and predicted output
Compute gradients of all weights (see next slide)
Update weights regarding gradient Repeat until stopping criterion
Backpropagation
Goal of gradient : update weights on the « right » direction (to lower the error) : − Which weight is responsible of the error and on
what « amount » ?
Backpropagation principle : gradient of a layer’s weights computed using « deltas » from the next layer ( : chain rule)
Neural Networks - Backpropagation
δout
δparamδnext
Update layer k ...
outk1
outk2
outkn
δnext
δoutk
1
δnext
δoutk
2
δnext
δoutkn
δout
δin
k−11
δnext
δout
δin
k−1m
δnext
...
ink−11
ink−1m
What does it look like if we compute backpropagation on thepreviously seen network with an example x i = [1,−1], y i = −1and a mean square error ?
5/9 G.Contardo Roscoff ICSD Summer School 2016 - Data Science and Machine Learning
Neural Networks - Backpropagation
ModularityDo not need to compute yourself all the gradients : build thenetwork as layers of modules (OOP)Distinguish activation function modules and parametric modules(weights layers)Basically, a module should have (at least) three functions :
Forward : given an input, predict the outputBackward : given an input, and δnext (gradient backpropagated fromnext module), return the gradient δto prev to backpropagate inprevious module: δto prev = δout
δinδnext
Update : given input and δnext , update the parameters if needed(e.g if module ”tanh”, not necessary because no parameters):compute gradient δout
δparamsδnext .
Rely on matrices computation.
6/9 G.Contardo Roscoff ICSD Summer School 2016 - Data Science and Machine Learning
Brain neural cell / Neural network cell
(credit image Karpathy’s course CS231n oxford)(credit image Karpathy’s course CS231n oxford)
Criterion – Loss functions Different cost functions can
be used, depending on the problem or the model
LMSE Regression
(but often used in classification)
Classification, Hinge, logistique
Hinge, logistique approximate classification error
Notations : d=desired output, y=predicted
output. Classif : d=[-1,1]
Figure from Bishop 2006
Abciss : z= y.d (desired * predicted output)
Criterion – Loss functions Different cost functions can
be used, depending on the problem or the model
LMSE Regression
(but often used in classification)
Classification, Hinge, logistique : approximate classification error
Also : cross-entropy/log-likelihood.
Notations : d=desired output, y=predicted
output. Classif : d=[-1,1]
Figure from Bishop 2006
Abciss : z= y.d (desired * predicted output)
Misc.
Batch / mini-batch / stochastic gradient descent :
→ Batch : compute gradient on all examples at each iteration. Good when convex loss.
→ Stochastic : compute on a randomly picked example at each iteration. Provides variance to get out local minima.
→Mini-batch : compute on a smaller subset of examples.
Back to playground : train loss vs test loss
→ Generalization / Overfitting
→ Regularization
Overfitting
Regularization
Regularization
Neural Networks Architectures
(credit image Karpathy’s course CS231n oxford)
Neural Networks Architectures
(credit image Karpathy’s course CS231n oxford)
Neural Networks Architectures
(credit image Karpathy’s course CS231n oxford)
Neural Networks
Questions ?
How to implement them ? Different (recent) libraries (provides GPU implem) :
– Torch (lua)
– Theano (python)
– Caffe (python/C++)
– TensorFlow (python)
– ….
Exercises
Consider a network with one hidden layer. We note fw (x) the outputof the network for input x (w are the parameters of the network).An input x i = {x i
j }j=1,...,d , its label y i , a training datasetD = {(x i , y i)}1,...,NWeights from input to hidden layer are noted w0 = {wjh}j=1,...,d ,h=1,...,HWeights from hidden layer to output layer arew1 = {whk}h=1,...,H,k=1,...,KActivation function for both layers are noted g1,g2
QuestionsHow many neurons are there in the networks ? How many outputs? Draw the network. Explain why there can be a number of outputsuperior to 1.Write the output fw (x) with regard to the components of x and w
7/9 G.Contardo Roscoff ICSD Summer School 2016 - Data Science and Machine Learning
Exercises
QuestionsHow many neurons are there in the networks ? How many outputs? Draw the network. Explain why there can be a number of outputsuperior to 1.Write the output fw (x) with regard to the components of x and wWrite the cost (mean square) w.r.t the training set D. Write itstheoretical formulation (using expected value).
8/9 G.Contardo Roscoff ICSD Summer School 2016 - Data Science and Machine Learning
Exercises
QuestionsWrite the output fw (x) with regard to the components of x and w :fw (x)k = g2(
∑h whkg1(
∑i wihxi))
Write the cost (mean square) w.r.t the training set D. Write itstheoretical formulation (using expected value).R(fw ) = Ex ,y ((fw (x)− y)2)(= 1/n
∑i(fw (x
i)− y i)2)
9/9 G.Contardo Roscoff ICSD Summer School 2016 - Data Science and Machine Learning