Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Deep neural networks
Outline • What’s new in ANNs in the last 5-10 years?
– Deeper networks, more data, and faster training • Scalability and use of GPUs ✔ • Symbolic differentiation
– reverse-mode automatic differentiation – “Generalized backprop”
• Some subtle changes to cost function, architectures, optimization methods ✔
• What types of ANNs are most successful and why? – Convolutional networks (CNNs) – Long term/short term memory networks (LSTM) – Word2vec and embeddings
• What are the hot research topics for deep learning?
Outline • What’s new in ANNs in the last 5-10 years?
– Deeper networks, more data, and faster training • Scalability and use of GPUs ✔ • Symbolic differentiation
– reverse-mode automatic differentiation – “Generalized backprop”
• Some subtle changes to cost function, architectures, optimization methods ✔
• What types of ANNs are most successful and why? – Convolutional networks (CNNs) – Long term/short term memory networks (LSTM) – Word2vec and embeddings
• What are the hot research topics for deep learning?
Recap:mul+layernetworks• Classifierisamul,layernetworkoflogis-cunits• Eachunittakessomeinputsandproducesoneoutputusingalogis,cclassifier
• Outputofoneunitcanbetheinputofanother
Input layer
Output layer
Hidden layer
v1=S(wTX) w0,1
x1
x2
1
v2=S(wTX)
z1=S(wTV)
w1,1
w2,1
w0,2
w1,2
w2,2
w1
w2
Recap:Learningamul+layernetwork
• Definealosswhichissquarederror– overanetworkoflogis,cunits
• Minimizelosswithgradientdescent
– outputisnetworkoutput
€
JX,y (w) = y i − ˆ y i( )i∑
2
Recap:weightupdatesformul+layerANN
δk ≡ tk − ak( ) ak 1− ak( )
δ j ≡ δkwkj( )k∑ aj 1− aj( )
For nodes k in output layer:
For nodes j in hidden layer:
For all weights:
“Propagate errors backward” BACKPROP
wkj = wkj −ε δkajwji = wji −ε δ jai
Can carry this recursion out further if you have multiple hidden layers
How can we generalize BackProp to other ANNs?
Generalizingbackprop
• Star,ngpoint:afunc,onofnvariables
• Step1:codeyourfunc,onasaseriesofassignments
• Step2:backpropagatebygoingthruthelistinreverseorder,star,ngwith…
• …andusingthechainrule
e.g.
Wengert list
return
inputs:
Step 1: forward
Step 1: backprop
https://justindomke.wordpress.com/
A function
Computed in previous step
Recap:logis+cregressionwithSGD
8
Let X be a matrix with k examples Let wi be the input weights for the i-th hidden unit Then Z = X W is output (pre-sigmoid) for all m units for all k examples
w1 w2 w3 … wm
0.1 -0.3 …
-1.7 …
0.3 …
1.2
x1 1 0 1 1
x2 …
…
xk
XW =
x1.w1 x1.w2
… x1.wm
xk.w1 … … xk.wm
There’s a lot of chances to do this in parallel…. with parallel matrix multiplication
Example:2-layerneuralnetwork
return
inputs: Step 1: forward
Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add*(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add*(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function
Step 1: backprop
Target Y; N examples; K outs; D feats, H hidden
X is N*D, W1 is D*H, B1 is 1*H, W2 is H*K, …
Z1a is N*H Z1b is N*H
A1 is N*H
Z2a is N*K Z2b is N*K
A2 is N*K P is N*K
C is a scalar
Example:2-layerneuralnetwork
return
inputs: Step 1: forward
Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add*(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add*(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function
Step 1: backprop
dC/dC = 1 dC/dP = dC/dC * dCrossEntY/dP dC/dA2 = dC/dP * dsoftmax/dA2 dC/Z2b = dC/dA2 * dtanh/dZ2b dC/dZ1a = dC/dZ2b * (dadd*/dZ2a + dadd/dB2) dC/dB2 = dC/Z2b * 1 dC/dZ2a = dC/dZ2b * (dmul/dA1 + dmul/dW2) dC/dW2 = dC/dZ2a * 1 dC/dA1 = …
Target Y; N rows; K outs; D feats, H hidden
N*K (N*K) * (K*K)
(N*K) N*H
−1N
pi − yiyi (1− yi )"
#$
%
&'
i∑
Example:2-layerneuralnetwork
return
inputs: Step 1: forward
Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function
Step 1: backprop
dC/dC = 1 dC/dP = dC/dC * dCrossEntY/dP dC/dA2 = dC/dP * dsoftmax/dA2 dC/Z2b = dC/dA2 * dtanh/dZ2b dC/dZ1a = dC/dZ2b * (dadd/dZ2a + dadd/dB2) dC/dB2 = dC/Z2b * 1 dC/dZ2a = dC/dZ2b * (dmul/dA1 + dmul/dW2) dC/dW2 = dC/dZ2a * 1 dC/dA1 = …
Target Y; N rows; K outs; D feats, H hidden
−1N
pi − yiyi (1− yi )"
#$
%
&'
i∑
N*K
N*H
Need a backward form for each matrix operation used in forward
Example:2-layerneuralnetwork
return
inputs: Step 1: forward
Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function
Step 1: backprop
dC/dC = 1 dC/dP = dC/dC * dCrossEntY/dP dC/dA2 = dC/dP * dsoftmax/dA2 dC/Z2b = dC/dA2 * dtanh/dZ2b dC/dZ1a = dC/dZ2b * (dadd/dZ2a + dadd/dB2) dC/dB2 = dC/Z2b * 1 dC/dZ2a = dC/dZ2b * (dmul/dA1 + dmul/dW2) dC/dW2 = dC/dZ2a * 1 dC/dA1 = …
Target Y; N rows; K outs; D feats, H hidden
N*K
N*H
Need a backward form for each matrix operation used in forward, with respect to each argument
Example:2-layerneuralnetwork
return
inputs: Step 1: forward
Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function
Target Y; N rows; K outs; D feats, H hidden
N*H
Need a backward form for each matrix operation used in forward, with respect to each argument
An autodiff package usually includes • A collection of matrix-oriented
operations (mul, add*, …) • For each operation
• A forward implementation • A backward implementation for
each argument
• It’s still a little awkward to program with a list of assignments so….
Generalizingbackprop
• Star,ngpoint:afunc,onofnvariables
• Step1:codeyourfunc,onasaseriesofassignments
Wengert list return
inputs:
Step 1: forward
https://justindomke.wordpress.com/
• BeHerplan:overloadyourmatrixoperatorssothatwhenyouusethemin-linetheybuildanexpressiongraph
• ConverttheexpressiongraphtoaWengertlistwhennecessary
Example:Theano
… # generate some training data
Example:Theano
p_1
Example:2-layerneuralnetwork
return
inputs: Step 1: forward
Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function
N*H
An autodiff package usually includes • A collection of matrix-oriented
operations (mul, add*, …) • For each operation
• A forward implementation • A backward implementation
for each argument
• A way of composing operations into expressions (often using operator overloading) which evaluate to expression trees
• Expression simplification/compilation
Sometoolsthatuseautodiff• Theano:
– UnivMontrealsystem,Python-based,firstversion2007nowv0.8,integratedwithGPUs
– ManylibrariesbuildoverTheano(Lasagne,Keras,Blocks..)• Torch:
– Collobertetal,usedheavilyatFacebook,basedonLua.Similarperformance-wisetoTheano
• TensorFlow:– Googlesystem,possiblynotwell-tunedforGPUsystemsusedbyacademics
– AlsosupportedbyKeras• Autograd
– NewsystemwhichisveryPythonic,doesn’tsupportGPUs(yet)• …
Outline• What’snewinANNsinthelast5-10years?– Deepernetworks,moredata,andfastertraining• ScalabilityanduseofGPUs✔• Symbolicdifferen,a,on✔• Somesubtlechangestocostfunc,on,architectures,op,miza,onmethods✔
• WhattypesofANNsaremostsuccessfulandwhy?– Convolu+onalnetworks(CNNs)– Longterm/shorttermmemorynetworks(LSTM)– Word2vecandembeddings
• Whatarethehotresearchtopicsfordeeplearning?
WHAT’SACONVOLUTIONALNEURALNETWORK?
Modelofvisioninanimals
VisionwithANNs
What’saconvolu+on?
1-D
https://en.wikipedia.org/wiki/Convolution
What’saconvolu+on?
1-D
https://en.wikipedia.org/wiki/Convolution
What’saconvolu+on?http://matlabtricks.com/post-5/3x3-convolution-kernels-with-online-demo
What’saconvolu+on?http://matlabtricks.com/post-5/3x3-convolution-kernels-with-online-demo
What’saconvolu+on?http://matlabtricks.com/post-5/3x3-convolution-kernels-with-online-demo
What’saconvolu+on?http://matlabtricks.com/post-5/3x3-convolution-kernels-with-online-demo
What’saconvolu+on?http://matlabtricks.com/post-5/3x3-convolution-kernels-with-online-demo
What’saconvolu+on?http://matlabtricks.com/post-5/3x3-convolution-kernels-with-online-demo
What’saconvolu+on?• Basicidea:– Picka3x3matrixFofweights– Slidethisoveranimageandcomputethe“innerproduct”(similarity)ofFandthecorrespondingfieldoftheimage,andreplacethepixelinthecenterofthefieldwiththeoutputoftheinnerproductopera,on
• Keypoint:– Differentconvolu,onsextractdifferenttypesoflow-level“features”fromanimage– AllthatweneedtovarytogeneratethesedifferentfeaturesistheweightsofF
HowdoweconvolveanimagewithanANN?
Note that the parameters in the matrix defining the convolution are tied across all places that it is used
Howdowedomanyconvolu+onsofanimagewithanANN?
Example:6convolu+onsofadigithttp://scs.ryerson.ca/~aharley/vis/conv/
CNNstypicallyalternateconvolu+ons,non-linearity,andthendownsampling
Downsampling is usually averaging or (more common in recent CNNs) max-pooling
Whydomax-pooling?• Savesspace• Reducesoverfidng?• BecauseI’mgoingtoaddmoreconvolu,onsaferit!
– Allowstheshort-rangeconvolu,onstoextendoverlargersubfieldsoftheimages• Sowecanspotlargerobjects• Eg,alonghorizontalline,oracorner,or…
AnotherCNNvisualiza+onhttps://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html
Whydomax-pooling?• Savesspace• Reducesoverfidng?• BecauseI’mgoingtoaddmoreconvolu,onsaferit!– Allowstheshort-rangeconvolu,onstoextendoverlargersubfieldsoftheimages• Sowecanspotlargerobjects• Eg,alonghorizontalline,oracorner,or…
• Atsomepointthefeaturemapsstarttogetverysparseandblobby–theyareindicatorsofsomeseman,cproperty,notarecognizabletransforma,onoftheimage
• Thenjustusethemasfeaturesina“normal”ANN
Whydomax-pooling?• Savesspace• Reducesoverfidng?• BecauseI’mgoingtoaddmoreconvolu,onsaferit!
– Allowstheshort-rangeconvolu,onstoextendoverlargersubfieldsoftheimages• Sowecanspotlargerobjects• Eg,alonghorizontalline,oracorner,or…
Alterna+ngconvolu+onanddownsampling
5 layers up
The subfield in a large dataset that gives the strongest output for a neuron
Outline• What’snewinANNsinthelast5-10years?– Deepernetworks,moredata,andfastertraining• ScalabilityanduseofGPUs✔• Symbolicdifferen,a,on✔• Somesubtlechangestocostfunc,on,architectures,op,miza,onmethods✔
• WhattypesofANNsaremostsuccessfulandwhy?– Convolu,onalnetworks(CNNs) ✔– Longterm/shorttermmemorynetworks(LSTM)– Word2vecandembeddings
• Whatarethehotresearchtopicsfordeeplearning?
RECURRENTNEURALNETWORKS
Mo+va+on:whataboutsequencepredic+on?
What can I do when input size and output size vary?
Mo+va+on:whataboutsequencepredic+on?
x
A
h
ArchitectureforanRNN
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Sequence of outputs
Sequence of inputs Start of
sequence marker
End of sequence marker
Some information
is passed from one
subunit to the next
Architectureforan1980’sRNN
…
Problem with this: it’s extremely deep and very hard to train
ArchitectureforanLSTM
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Longterm-short term model
σ: output in [0,1] tanh: output in [-1,+1]
Decide what
to forget
Decide what
to insert
Combine with transformed xt
“Bits of memory”
WalkthroughWhat part of
memory to “forget” – zero means forget
this bit
Walkthrough
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
What bits to insert into the next states
What content to store into the next state
Walkthrough
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Next memory cell content – mixture of not-forgotten part of
previous cell and insertion
Walkthrough
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
What part of cell to output
tanh maps bits to [-1,+1] range
ArchitectureforanLSTM
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
it
Ct-1
ot
(1)
(2)
(3)
ft
Implemen+nganLSTM
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
(1)
(2)
(3)
For t = 1,…,T:
SOMEFUNLSTMEXAMPLES
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
LSTMscanbeusedforothersequencetasks
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
sequence classification translation named entity
recognition image captioning
Character-levellanguagemodel
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Test time: • pick a seed
character sequence
• generate the next character
• then the next • then the next …
Character-levellanguagemodel
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Character-levellanguagemodel
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Yoav Goldberg: order-10 unsmoothed character n-grams
Character-levellanguagemodel
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Character-levellanguagemodel
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
LaTeX “almost compiles”
Character-levellanguagemodel
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Outline• What’snewinANNsinthelast5-10years?– Deepernetworks,moredata,andfastertraining• ScalabilityanduseofGPUs✔• Symbolicdifferen,a,on✔• Somesubtlechangestocostfunc,on,architectures,op,miza,onmethods✔
• WhattypesofANNsaremostsuccessfulandwhy?– Convolu,onalnetworks(CNNs) ✔– Longterm/shorttermmemorynetworks(LSTM)– Word2vecandembeddings
• Whatarethehotresearchtopicsfordeeplearning?
WORD2VECANDWORDEMBEDDINGS
Basicideabehindskip-gramembeddings
from an input word w(t) in a document
construct hidden layer that
“encodes” that word
So that the hidden layer will predict
likely nearby words w(t-K), …,
w(t+K)
final step of this prediction is a softmax over lots of outputs
Basicideabehindskip-gramembeddings
Training data: positive examples are pairs of words w(t), w(t+j) that co-occur
Training data: negative examples are samples of pairs of words w(t), w(t+j) that don’t co-occur
You want to train over a very large corpus (100M words+) and hundreds+ dimensions
Resultsfromword2vec
https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
Resultsfromword2vec
https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
Resultsfromword2vec
https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
Outline• What’snewinANNsinthelast5-10years?– Deepernetworks,moredata,andfastertraining• ScalabilityanduseofGPUs✔• Symbolicdifferen,a,on✔• Somesubtlechangestocostfunc,on,architectures,op,miza,onmethods✔
• WhattypesofANNsaremostsuccessfulandwhy?– Convolu,onalnetworks(CNNs) ✔– Longterm/shorttermmemorynetworks(LSTM)– Word2vecandembeddings
• Whatarethehotresearchtopicsfordeeplearning?
Somecurrenthottopics
• Mul,-tasklearning– Doesithelptolearntopredictmanythingsatonce?e.g.,POStagsandNERtagsinawordsequence?– Similartoword2veclearningtoproduceallcontextwords
• ExtensionsofLSTMsthatmodelmemorymoregenerally– e.g.forques,onansweringaboutastory
Somecurrenthottopics
• Op,miza,onmethods(>>SGD)• Neuralmodelsthatinclude“aHen,on”– Abilityto“explain”adecision
Examplesofaaen+on
https://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf
Examplesofaaen+on
https://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf
Basic idea: similarly to the way an LSTM chooses what to “forget” and “insert” into memory, allow a network to choose a path to focus on in the visual field
Somecurrenthottopics
• Knowledge-baseembedding:extendingword2vectoembedlargedatabasesoffactsabouttheworldintoalow-dimensionalspace.– TransE,TransR,…
• “NLPfromscratch”:sequence-labelingandotherNLPtaskswithminimalamountoffeatureengineering,onlynetworksandcharacter-orword-levelembeddings
Somecurrenthottopics
• Computervision:complextaskslikegenera,nganaturallanguagecap,onfromanimageorunderstandingavideoclip
• Machinetransla,on– EnglishtoSpanish,…
• Usingneuralnetworkstoperformtasks– Drivingacar– Playinggames(likeGoor…)