Understanding Convolutional Neural Networks

Understanding Convolutional Neural Networks

Jeremy Nixon

Jeremy Nixon● Machine Learning Engineer at the Spark Technology Center● Contributor to MLlib, dedicated to scalable deep learning

○ Author of Deep Neural Network Regression

● Previously, Applied Mathematics to Computer Science & Economics at Harvard

Structure1. Introduction / About2. Motivation

a. Comparison with major machine learning algorithmsb. Tasks achieving State of the Artc. Applications / Specific Concrete Use Cases

3. The Model / Forward Pass4. Framing Deep Learning

a. Automated Feature Engineeringb. Non-local generalizationc. Compositionality

i. Hierarchical Learningii. Exponentially Model Flexibility

d. Learning Representationi. Transformation for Linear Separabilityii. Input Space Contortion

e. Extreme flexibility allowing benefits to large datasets5. Optimization / Backward Pass6. Conclusion

Many Successes of Deep Learning1. CNNs - State of the art

a. Object Recognitionb. Object Localizationc. Image Segmentationd. Image Restoratione. Music Recommendation

2. RNNs (LSTM) - State of the Arta. Speech Recognitionb. Question Answeringc. Machine Translationd. Text Summarizatione. Named Entity Recognitionf. Natural Language Generation

g. Word Sense Disambiguationh. Image / Video Captioningi. Sentiment Analysis

Ever trained a Linear Regression Model?

Linear Regression ModelsMajor Downsides:

Cannot discover non-linear structure in data.

Manual feature engineering by the Data Scientist. This is time consuming and can be infeasible for high dimensional data.

Decision Tree Based Model? (RF, GB)

Decision Tree ModelsUpside:

Capable of automatically picking up on non-linear structure.

Downsides:

Incapable of generalizing outside of the range of the input data.

Restricted to cut points for relationships.

Thankfully, there’s an algorithmic solution.

Neural NetworksProperties

1. Non-local generalization2. Learning Non-linear structure3. Automated feature generation

Generalization Outside Data Range

Feedforward Neural NetworkX = Normalized Data, W1, W2 = Weights, b = Bias

Forward:

1. Multiply data by first layer weights | (X*W1 + b1)2. Put output through non-linear activation | max(0, X*W1 + b1)3. Multiply output by second layer weights | max(0, X*W1 + b) * W2 + b24. Return predicted outputs

The Model / Forward Pass● Forward

○ Convolutional layer■ Procedure + Implementation■ Parameter sharing■ Sparse interactions■ Priors & Assumptions

○ Nonlinearity■ Relu■ Tanh

○ Pooling Layer■ Procedure + Implementation■ Extremely strong prior on image, invariance to small translation.

○ Fully Connected + Output Layer○ Putting it All Together

Convolutional LayerInput Components:

1. Input Image / Feature Map2. Convolutional Filter / Kernel / Parameters / Weights

Output Component:

1. Computed Output Image / Feature Map

Convolutional Layer

Goodfellow, Bengio, Courville

Convolutional Layer

Leow Wee Kheng

Convolutional Layer

1. Every filter weight is used over the entire input.a. This differs strongly from a fully connected network where each weight corresponds to a

single feature.

2. Rather than learning a separate set of parameters for each location, we learn a single set.

3. Dramatically reduces the number of parameters we need to store.

Parameter Sharing

Bold Assumptions1. Convolution be thought of as a fully connected layer with an infinitely strong prior probability that

a. The weights for one hidden unit must be identical to the weights of its neighbor. (Parameter Sharing)

b. Weights must be zero except for in a small receptive field (Sparse Interactions)2. Prior assumption of invariance to locality

a. Assumptions overcome data augmentation with translational shiftsi. Other useful transformations include rotations, flips, color perturbations, etc.

b. Equivariant to translation as a result of parameter sharing, but not to rotation or scale (closer in / farther)

Sparse Interactions

Strong prior on the locality of information.

Deep networks end up with greater connectivity.

Non-Linearities● Element-wise transformation (Applied individually over every element)

Relu Tanh

Max Pooling

Downsampling.

Takes the max value of regions of the input image or filter map.

Imposes extremely strong prior of invariance to translation.

Mean Pooling

Output Layer● Output for classification is often a Softmax function + Cross Entropy loss.● Output for regression is a single output from a linear (identity) layer with a

Sum of Squared Error loss.● Feature map can be flattened into a vector to transition to a fully

connected layer / softmax.

Putting it All Together

We can construct architectures that combine convolution, pooling, and fully connected layers similar to the examples given here.

Framing Deep Learning1. Automated Feature Engineering2. Non-local generalization3. Compositionality

a. Hierarchical Learningb. Exponential Model Flexibility

4. Extreme flexibility opens up benefits to large datasets5. Learning Representation

a. Input Space Contortionb. Transformation for Linear Separability

Automated Feature Generation● Pixel - Edges - Shapes - Parts - Objects : Prediction● Learns features that are optimized for the data

Non - Local Generalization

Hierarchical Learning● Pixel - Edges - Shapes - Parts - Objects : Prediction

Hierarchical Learning● Pixel - Edges - Shapes - Parts - Objects : Prediction

Exponential Model Flexibility

● Deep Learning assumes data was generated by a composition of factors or features.

○ DL has been most successful when this assumption holds.

● Exponential gain in the number of relationships that can be efficiently models through composition.

Model Flexibility and Dataset Size

Large datasets allow the fitting of extremely wide & deep models, which would have overfit in the past.

A combination of large datasets, large & flexible models, and regularization techniques (dropout, early stopping, weight decay) are responsible for success.

Learning Representation:Transform for Linear Separability

Hidden Layer+

Nonlinearity

Chris Olah: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

The goal:

Iteratively improve the filter weights so that they generate correct predictions.

We receive an error signal from the difference between our predictions and the true outcome.

Our weights are adjusted to reduce that difference.

The process of computing the correct adjustment to our weights at each layer is called backpropagation.

Backward Pass / Optimization

Convolutional Neural NetworksState of the Art in:

● Computer Vision Applications○ Autonomous Cars

■ Navigation System■ Pedestrian Detection / Localization■ Car Detection / Localization■ Traffic Sign Recognition

○ Facial Recognition Systems○ Augmented Reality

■ Visual Language Translation○ Character Recognition

Convolutional Neural NetworksState of the Art in:

● Computer Vision Applications○ Video Content Analysis○ Object Counting○ Mobile Mapping○ Gesture Recognition○ Human Facial Emotion Recognition○ Automatic Image Annotation○ Mobile Robots○ Many, many more

References● CS 231: http://cs231n.github.io/● Goodfellow, Bengio, Courville: http://www.deeplearningbook.org/● Detection as DNN Regression: http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf● Object Localization: http://arxiv.org/pdf/1312.6229v4.pdf● Pose Regression: https://www.robots.ox.ac.uk/~vgg/publications/2014/Pfister14a/pfister14a.pdf● Yuhao Yang CNN: (https://issues.apache.org/jira/browse/SPARK-9273)● Neural Network Image: http://cs231n.github.io/assets/nn1/neural_net.jpeg● Zeiler / Fergus: https://arxiv.org/pdf/1311.2901v3.pdf

http://cs231n.github.io/

http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf

http://arxiv.org/pdf/1312.6229v4.pdf

https://www.robots.ox.ac.uk/~vgg/publications/2014/Pfister14a/pfister14a.pdf

http://cs231n.github.io/assets/nn1/neural_net.jpeg

https://arxiv.org/pdf/1311.2901v3.pdf