Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Summary
Deep LearningA brief explanation
Centre for Digital Music, Queen Mary University of London, UK
1/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Summary
1 Introduction
2 Machine-Learning
3 Deep learningOverviewNonlinearityWeightsSGDTraining
4 IssuesOverfittingBatch processingBack-propagationOther architecturesImageNet
5 Summary
2/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Summary
Keunwoo Choi
PhD, QMUL, EECS, c4dm, 2014-presentSupervised by Mark Sandler and George FazekasMusic Recommendation, (Deep) Machine LearningInternship, Naver Labs, July-Oct 2015Visiting PhD, New York University, July-Dec 2016
ETRI, 2011-20143D Audio (WFS)
Master’s, SNU EECS, 2009-2011Applied Acoustics Laboratory, 3D Audio,Music Signal Processing
Bachelor. SNU EECS, 2005-2009
3/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Summary
Research Topics
Music Feature ExtractionsAnalysis of deep CNNs (ISMIR LDB 2015, MLSP 2016)Auto-Tagging using deep CNN (ISMIR 2016)
Playlist GenerationRNN-based playlist generation (ICML workshop 2016)
Music Captioning
Automatic CompositionText-based chords and drums (CSMC 2016)
4/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Summary
Machine LearningMore correctly, supervised learning
Given a goal
Given data x , y
Train an algorithm that best matches x ! y
and validated using unseen x (good generalisation)”Do not memorise the examples!”
Conventional approaches:Feature extraction + ClassifierResearchers and experts hand-craft the featuresClassifier (e.g. SVM) is trained to achieve the goal
5/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Summary
Machine LearningProblems of the conventional approaches
Hand-crafting takes resources
E.g. MFCCs (speech recognition), Histogram of Gradient,SIFT (computer vision)
Hand-crafting is not automatically optimisablebut a Jang-in-jeong-sin thingy.
Is a Jang-in better than machines?
6/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
NN vs. DNN
1
Logistic regression: No hidden layer
Neural Networks: 1 hidden layer
Deep NN: N hidden layers (N>1)
1extremetech.com
7/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
DEMOTensorFlow Playground
Logistic regression: No hidden layer
Neural Networks: 1 hidden layer
Deep NN: N hidden layers (N>1)
Logistic regression Logistic regression failsNN works well! NN failsShallow NN is okay The bigger, the better
8/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
DL OverviewA motivation to deep leanring
Brain and human sensory system
Neurons are identical
Many (100B) identical neurons with suitable structures
Human learns by examples
Human sensory systems are deep
Parallel and serial neuron structure
9/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
DL OverviewA motivation to deep leanring
Do not need to hand-craft features
Black box includes [feature extraction ! classifier]
The whole procedure is computationally optimised toachieve the goal
by iterative, heavy-computational methodshave outperformed many Jang-in’s
10/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
ComparisonExample task: Speech recognition
Method Conventional ML Deep Learning
Feature
MFCCs(FFT ! mel-scaleaggregation!DCT!time-
derivative!ignore firstcoe↵!..)
FFT!NN
Classifier SVM, GMM NN
Every computation, parameters, weights is automaticallydecided by during training
11/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
DLNonlinearity
Single layer performs a nonlinear mapping using �Let x=input vector, y=output vector,
NN: y = �2(W2�1(W1x))
DNN: Stacked (=deep) layers perform a more nonlinearand complex mapping
y = �6(W6�5(W5�4(W4�3(W3�2(W2�1(W1x))))))
Stacked layers = stacked Nonlinearity! 2
Multiple linear layers, otherwise, can be compressed intoone layer
2best explained in Colah’s blog
12/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
DLWeights (= parameters)
NN = nonlinear �() and weights W
For �(), we use ReLU and its variants
DNN = Combination of ReLU and many Wi
’s
We want...the network to be trained to do the all dirty works -feature extraction and classification(=W
i
’s that do what we order to do)the network to learn by examples(=find the optimal W using training data)
How do we train? ! SGD
13/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
Deep LearningHow it learns - by SGD!
SGD: Stochastic Gradient Descent
SGD computationally finds w so that J(w) is minimisedSGD iteratively finds w so that J(w) is minimisedSGD gradually finds w so that J(w) is minimised
w is updated to minimise J(w)
(J(w) J(w)� @J(w)@w )
...if J(w) is di↵erentiable14/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
Deep LearningHow it learns - by SGD!
Loss function J(w)
A function that we want to minimise to achieve the goal
y
estimation
= �4(W4�3(W3�2(W2�1(W1x))))
y
true
is given in the dataset
E.g. l2: J(w) = (yestimation
� y
true
)2
Loss function measures how well the current algorithm isperforming
15/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
Deep LearningHow it learns
We have (a set of) x and y
true
(aka dataset)
We decide a loss function
y
estimation
= �4(W4�3(W3�2(W2�1(W1x))))
J(w) = a function of (yestimation
, ytrue
)w is updated and becomes better weights= training is performed by SGD= the DNN is optimised
16/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
Deep LearningThe whole learning procedure
Prepare a training dataset (x , y)
Get a DNN configured (number of layers, nodes, lossfunction)
for many times:for every x , y : (do SGD)
compute y
estimation
= f (x ,w) (go through DNN)update W according to the current loss,loss(y
true
, yestimation
)
Done!
17/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Overview
Nonlinearity
Weights
SGD
Training
Issues
Summary
Break!
Q&A
playground.tensorflow.org
18/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Overfitting
Batch processing
Back-propagation
Otherarchitectures
ImageNet
Summary
Overfitting
Overfitting
When the network memorises the training data and fails togeneralise
3
A general problem in ML
Example: ,3hrefhttp://cs231n.github.io/neural-networks-3/cs231n
19/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Overfitting
Batch processing
Back-propagation
Otherarchitectures
ImageNet
Summary
Batch Gradient Descent
Batch Gradient Descent
Compute GD with seeing more than 1 examples simultaneously
Every computation ofy
estimation
= �4(W4�3(W3�2(W2�1(W1x))))is done by matrix computationsQuicker in GPU (because GPU is specialised at computinglarge matrix computations)Less zig-zag
4
4www.holehouse.org
20/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Overfitting
Batch processing
Back-propagation
Otherarchitectures
ImageNet
Summary
Back-propatagionaka backprop
5
The essence inside Gradient Descent of NN
The way to compute the derivatives of all weights, @J(w)@w
so that J(w) can be updated as J(w)� @J(w)@w
Discovered by Rumelhart, Hinton, and Williams (1986)
5extremetech.com
21/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Overfitting
Batch processing
Back-propagation
Otherarchitectures
ImageNet
Summary
Other architectures
Convolutional Networksby LeCun (in Facebook AI Research and NYU)Biological visual systemsVery widely used in almost every DL problem
Recurrent networksSequences (text) and time-series data (speech, weather,stock price,...)
22/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Overfitting
Batch processing
Back-propagation
Otherarchitectures
ImageNet
Summary
ImageNet competition
6
14M images in 1K categoriesHave enabled to test new algorithms in DL
6Slide from NVIDIA
23/24
Deep Learning
Introduction
Machine-Learning
Deep learning
Issues
Summary
Resources
Deeplearning4j tutorials (Korean)
ML lecture in Coursera, Stanford
cs231n from Stanford
24/24