Upload
dinhdieu
View
217
Download
4
Embed Size (px)
Citation preview
A Brief Introduction to Deep Learning and its Application to Vision
Recognition
Shangwen LiAdvisor: C.-C. Jay Kuo
1
Friday, May 9, 14
What is Deep Learning?
4
• Hierarchical Representation• Structure of the system naturally matches the problem which is
inherently hierarchical.
• Called “Deep Learning” because the hierarchy is “DEEEEEEP”!
• Feature Learning• Features are learned from data rather than hand crafted
• End to End Learning• Features and classifiers learned jointly using data
3 Key Ideas
Friday, May 9, 14
Hierarchical Representation
5
Deep Learning Based
Traditional Object Classification
“Car”SIFT/HOG K Mean Classifier
Pixels Edge Texture Pattern Part Object
Recognizing Object Hierarchically
They call it shallow learning
........
Friday, May 9, 14
Feature Learning
“Car”1st Layer
2nd Layer
3rd Layer Classifier
Extracting feature hierarchically (aka layer by layer)
Low-level features are shared among categories
6
High-level features are more global and more invariant
Zeiler, Matthew D., and Rob Fergus. "Visualizing and Understanding Convolutional Neural Networks." arXiv preprint arXiv:1311.2901 (2013).
Friday, May 9, 14
End to End Learning
7
“Car”1st Layer
2nd Layer
3rd Layer Classifier
• End: Image End: Class Label
• Layers and classifier can be extracted as connected network and trained jointly
• Each layer can be seen as a non-linear transformation of input
• Goal: learn a non-linear mapping function between image and label
Friday, May 9, 14
Why use Deep Learning?
8
Theoretically (Without Proof!!)
• Simplest Answer:• More efficient
• Simpler Answer: • More efficient for representing complicated mapping functions
• More Technically: • Trade breath for depth
Practically
• Learning good features
Friday, May 9, 14
“Shallow” Theory of Deep Learning
9
• Let us look at a common non-linear mapping: Kernel Machine
• Kernel Machine can be considered as a two-layer non-linear mapping
• What does deep learning try to learn?
• A hierarchy of non-linear mapping (K layers)
• Intuitively, deep architecture are more efficient for representing complex functions
Friday, May 9, 14
“Shallow” Theory of Deep Learning
10
• Let us look at a common non-linear mapping: Kernel Machine
• Kernel Machine can be considered as a two-layer non-linear mapping
• What does deep learning try to learn?
• A hierarchical of non-linear mapping (K layers)
• Intuitively, deep architecture are more efficient for representing complex functions
No So
lid Pr
oof
Friday, May 9, 14
“Shallow” Theory of Deep Learning
11
• Let us look at a common non-linear mapping: Kernel Machine
• Kernel Machine can be considered as a two-layer non-linear mapping
• What does deep learning try to learn?
• A hierarchical of non-linear mapping (K layers)
• Intuitively, deep architecture are more efficient for representing complex functions
No So
lid Pr
oof
If we only study models for which we can prove things, we wouldn't have speech, handwriting, and visual object recognition systems today.
Yann Lecun
Friday, May 9, 14
Trade breath for depth• “So Called” Logic Circuit Example for calculating N-bit parity
12
Use Multiple Layers
.....XOR XOR XOR XOR
XOR XOR
XOR
.....
N-1 XOR gates in a tree of depth log(N)
Use Two Layers??
• Decompose into “AND” & “OR”
• Proved that need O(exp(N)) number of gate elements to achieve this
• Shorter but Wider
...........................AND AND AND AND
OR OR
Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends® in Machine Learning 2.1 (2009): 1-127.
Friday, May 9, 14
What is Good Features?
13
• The Manifold Hypothesis:
• Natural data lives in a low-dimensional (non-linear) manifold
• Because variables in natural data are mutually dependent
Source: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16
Friday, May 9, 14
What is Good Features?
14
• Example: all face images of a person• 1000x1000 pixels = 1,000,000 dimensions• But the face has 3 cartesian coordinates and 3 Euler angles• And humans have less than about 50 muscles in the face• Hence the manifold of face images for a person has <56 dimensions
• The perfect representations of a face image:• Its coordinates on the face manifold • Its coordinates away from the manifold
• No general methods to learn this kind of representation
Source: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16
Friday, May 9, 14
What is Good Features?
15
• The Ideal Disentangling Feature Extractor
• Deep Learning aim at learning this good feature extractorSource: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16
Friday, May 9, 14
Current Status
• Despite lacking theoretical foundation, it works much better than traditional algorithm in vision recognition.
• All participants in ILSVRC (Large Scale Visual Recognition Challenge hosted by ImageNet) adopt deep architecture in their framework
http://www.clarifai.com/
16
Interesting Demo
Friday, May 9, 14
clarifai Example
17http://www.clarifai.com/
Friday, May 9, 14
clarifai Example
18http://www.clarifai.com/
Friday, May 9, 14
clarifai Example
19http://www.clarifai.com/
Friday, May 9, 14
clarifai Example
20
Yann Lecun
http://www.clarifai.com/
Friday, May 9, 14
Aside: “Avengers” in Machine Learning
22
SVM
Deep Belief Net
Convolutional Neural Network
Deep Auto Encoder Deconvolutional Network
Friday, May 9, 14
Space of Machine Learning Algorithm
23
RNN
CNN
RBM
GMMSparse Coding
Auto Encoder
SVM
Boosting
Decision Tree
Perceptron
DBN
Deep Auto Encoder
Neural Network
Sum Product
Source: Marc'Aurelio Ranzato. "Deep Learning for Object Category Recognition" Guest Lecture, Stanford, 11 February 2014
Friday, May 9, 14
Space of Machine Learning Algorithm
24
RNN
CNN
RBM
GMMSparse Coding
Auto Encoder
SVM
Boosting
Decision Tree
Perceptron
DBN
Deep Auto Encoder
Neural Network
Sum Product
Deep Shallow
Friday, May 9, 14
Space of Machine Learning Algorithm
25
RNN
CNN
RBM
GMMSparse Coding
Auto Encoder
SVM
Boosting
Decision Tree
Perceptron
DBN
Deep Auto Encoder
Neural Network
Sum Product
Supervised
Unsupervised
Deep Shallow
Friday, May 9, 14
Space of Machine Learning Algorithm
26
RNN
CNN
RBM
GMMSparse Coding
Auto Encoder
SVM
Boosting
Decision Tree
Perceptron
DBN
Deep Auto Encoder
Neural Network
Sum Product
Supervised
Unsupervised
Probabilistic
Deep Shallow
Friday, May 9, 14
Space of Machine Learning Algorithm
27
RNN
CNN
RBM
GMMSparse Coding
Auto Encoder
SVM
Boosting
Decision Tree
Perceptron
DBN
Deep Auto Encoder
Neural Network
Sum Product
Supervised
Unsupervised
Probabilistic
Deep Shallow
Friday, May 9, 14
Neural Network
28
Basic Neuron
Connection Weight Vector WActivation Function fActivation a = hW,b(x)
Simplest Neural Network (Shallow)
Only one hidden layerConnection Weight between layers is a matrix in general
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
Friday, May 9, 14
Multilayer Neural Network
29
• When we stacked hidden layers together, we obtain a deep architecture
• Goal: train the network (obtain the optimal connection weights that minimize a certain cost function)
Neural Network with Two Hidden Layers
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
Friday, May 9, 14
Neural Network Training
30
• Optimize connection weights with respect to a certain cost function
MSE Cost
• What is the first optimization algorithm that comes to mind?
• Gradient Descend
• Is it obvious to calculate the derivate of the cost function?
• No, final output is a hierarchy of non linear mapping
Complicate Composite Function
Friday, May 9, 14
Calculate Gradient
31
Back Propagation = Use Chain Rule
“Ship”W1x W2h1 W3h2
“Car”
Loss
• Loss L(W) can be:
• MSE for regression problem
• Cross Error Entropy for classification problem
h1 h2 h3
Source: Marc'Aurelio Ranzato. "Deep Learning for Object Category Recognition" Guest Lecture, Stanford, 11 February 2014
Friday, May 9, 14
Calculate Gradient
32
Back Propagation = Use Chain Rule
W1x W2h1 W3h2
“Car”
Loss
• Assuming we can calculate the loss gradient with respect to output h3
h1 h2
Source: Marc'Aurelio Ranzato. "Deep Learning for Object Category Recognition" Guest Lecture, Stanford, 11 February 2014
Friday, May 9, 14
Calculate Gradient
33
Back Propagation = Use Chain Rule
W1x W2h1 W3h2
“Car”
Loss
h1
Source: Marc'Aurelio Ranzato. "Deep Learning for Object Category Recognition" Guest Lecture, Stanford, 11 February 2014
Friday, May 9, 14
Calculate Gradient
34
Back Propagation = Use Chain Rule
W1x W2h1 W3h2
“Car”
Loss
Source: Marc'Aurelio Ranzato. "Deep Learning for Object Category Recognition" Guest Lecture, Stanford, 11 February 2014
Friday, May 9, 14
Auto Encoder
35
• Essentially a neural network
• The goal of the network aims at reconstructed the input itself with minimum error
If we enforce the hidden layers unit size to be smaller than the input layer,
what does this remind you????
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
Friday, May 9, 14
Auto Encoder
36
• Essentially a neural network
• The goal of the network aims at reconstructed the input itself with minimum error
If we enforce the hidden layers unit size is smaller than the input layer,
what does this remind you????
Compression!!!This is why it is called as “encoder”
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
Friday, May 9, 14
Sparse Auto Encoder
37
• If we do not enforce the size of hidden layer to be smaller than the input layer, we can still obtain some meaningful structure among the data
• Enforce the sparse output of hidden layer unit
• Enforce the average activation of hidden layer unit to be small, approximately 0.05
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
Friday, May 9, 14
Deep Sparse Auto Encoder
38
• We can stack the several layers together to train a deep sparse auto encoder
• Can use layer wise training method to train the auto encoder
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
Friday, May 9, 14
Convolutional Neural Network
39
• Example: 200x200 image• Fully-connected, 400,000 hidden units = 16 billion parameters• Locally-connected, 400,000 hidden units 10x10 fields = 40 million
params• Local connections capture local dependencies
Source: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16
Friday, May 9, 14
Convolutional Neural Network
40
• STATIONARITY? Statistics are similar at different locations hidden units
• Features that are useful on one part of the image and probably useful elsewhere.
• All units share the same set of weights• Shift equivalent processing:
• When the input shifts, the output also shifts but stays otherwise unchanged
• The filtered “image” is called a feature map
Source: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16
Friday, May 9, 14
Convolutional Neural Network
41
• Detects multiple features at each location• The collection of units looking at the same patch is
akin to a feature vector for that patch.• The result is a 3D array, where each slice is a
feature map.
Source: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16
Friday, May 9, 14
Convolutional Neural Network - Pooling
42
• Let us assume filter is an “eye” detector• How can we make the detection robust to the exact location of the eye?• By “pooling” (e.g., taking max) filter responses at different locations we gain
robustness to the exact spatial location of features.
Source: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16
Friday, May 9, 14
Deep Convolutional Neural Network
43
WINNER of ILSVRC2010Source: http://deeplearning.net/ & Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks." NIPS. Vol. 1. No. 2. 2012.
Friday, May 9, 14
Three Types of Training Framework
44
• Purely Supervised• Initialize parameters randomly
• Train in supervised mode with Stochastic Gradient Descent
• Good when there is lots of labeled data
• Layer-wise Unsupervised + Supervised Classifier• Train each layer unsupervised in sequence
• Hold fix the feature extractor, train linear classifier on features
• Good when labeled data is scarce but there is lots of unlabeled data
• Layer-wise Unsupervised + Supervised ALL• Layer-wise Unsupervised + Supervised Classifier
• Retrain the whole framework supervised
• Better performanceSource: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16
Friday, May 9, 14
WARNING: Bunch of Training Trick
• Use ReLU non-linearities (tanh and logistic are falling out of favor)• Use cross-entropy loss for classification• Use Stochastic Gradient Descent on minibatches• Shuffle the training samples• Normalize the input variables (zero mean, unit variance)• Schedule to decrease the learning rate• Use a bit of L1 or L2 regularization on the weights (or a combination)
• But it's best to turn it on after a couple of epochs• Use “dropout” for regularization
• Hinton et al 2012 http://arxiv.org/abs/1207.0580• Lots more in [LeCun et al. “Efficient Backprop” 1998]• Lots, lots more in “Neural Networks, Tricks of the Trade” (2012 edition)
edited by G. Montavon, G. B. Orr, and K-R Müller (Springer)
45Source: Yann, Lecun. "Deep LearningTutorial." ICML, Atlanta, 2013-06-16
Friday, May 9, 14
WARNING: Non Convex Optimization
46
Deep Learning Optimization is :Like walking on a ridge between valleys
Source: Marc'Aurelio Ranzato. "Deep Learning for Object Category Recognition" Guest Lecture, Stanford, 11 February 2014
Friday, May 9, 14
Introduction
48
Object Classification is hard!!!
67% in Caltech-101
vs.
36% in Caltech-256
• Performance decrease dramatically as the number of classes increase
• More training data per class will help to increase the performance
Griffin, Gregory, Alex Holub, and Pietro Perona. "Caltech-256 object category dataset." (2007).
Friday, May 9, 14
Related Work
49
WINNER IS
Research group from University of Toronto
PERFORMANCE is
Achieve 62.5% recognition accuracy in classification task over 1000 categories
SECRET IS
• Take advantage of Deep Convolution Neural Network (CNN)
• Nonlinearity property of each activation function within the network
• Massive amount of labeled data provided by ImageNet
ILSVRC2010 (Large Scale Visual Recognition Challenge 2010)
PROBLEM !!!
• There seldom exist such massive amount of labeled data in real world !!!
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks." In NIPS, vol. 1, no. 2, p. 4. 2012.
Friday, May 9, 14
What the Training Dataset look like
Dataset Problem
50
• Consider a situation we want to train a classifier to distinguish between images of cats and dogs
What real world data look like
Friday, May 9, 14
Dataset Problem
51
• To create a training dataset
• Retrieve Images contain only dogs and cats
• Manually label each image with “cat” or “dog”
• Time consuming and laborious
• Labeled training data are limited but valuable resources
• Can we learn something from these unlabeled data and thus boost our performance on object classification?
• This is called Unsupervised Feature Learning
Friday, May 9, 14
Dataset
52
• Contains 10 classes of objects: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck
• Large number of unlabeled images that comes from a wide range of other different classes
STL-10 dataset
• Focus on classify very detail classes which locates at the leaf node of the concept trees
• 150GB huge size goes beyond our handling capability
ILSVRC2010 dataset
Friday, May 9, 14
Methodology
53
•Convolutional Neural Network using labeled training dataset (500 images per class)
•Network trained jointly using back propagation
•Convolutional filters trained using 100000 image patches from unlabeled data
•Multinomial logistic regression still trained using labeled dataset
Classification Accuracy and Training Time Comparison
• Utilize more unlabeled data, thus can extract better representative features : Better classification
• Avoid back propagation in convolutional neural network training, more efficient to train the filter kernel : Faster training
Unsupervised TrainingSupervised Training
Theoretically, Unsupervised training will have advantages
Friday, May 9, 14
Supervised Training CNN Framework
54
• Train the whole network (convolution filters and Multinomial Logistic model) jointly using back propagation
• Slow due to convolution process during back propagation
Convolution Layer Pooling Layer Stacked Feature Vector
Multinomial Logistic Regression
“Cat”
Convolution Filters and Multinomial Logistic Regression weights trained jointly
Friday, May 9, 14
Unsupervised Feature Learning Framework
55
First train a sparse auto encoder, the trained filter kernel will be used in the following stages
Randomly Sampled Image Patches Hidden Layer
ReconstructedImage Patches
Weights learned for convolution
No need for back propagation
“Cat”
Only Weights here need to be trained
Friday, May 9, 14
Preliminary Experiment Results
56
• 2000 Training Images and 3200 Test Images from 4 classes in the dataset
• 100000 Image patches randomly sampled from unlabeled data
Method Classification Accuracy Training Time
Supervised 65.156% 27964s ~ 8h
Unsupervised 80.406% 5023s ~ 1.5h Trained Features
Convolved Feature Maps
Friday, May 9, 14
Mini Conclusion for my work
57
• Current good performance of deep learning method in object classification may partly due to large amount of training data
• Unsupervised feature learning is promising in boosting the performance in object classification task in terms of both classification accuracy and training speed
Conclusion
Friday, May 9, 14
Conclusion
59
• What is Deep Learning
• Learning feature in a deep hierarchy
• Why Deep Learning seems good
• Cascade non linear transformation is more efficient
• Current Status
• Lack the the theory foundation and require tricky training techniques
Conceptually
Technically
• Hope you remember
• Back propagation = Use chain rule to calculate gradient
• What is Auto Encoder and Convolutional Neural Network
Friday, May 9, 14