Deep neural networkswcohen/10-601/deep-2.pdfDeep neural networks. Outline ... – Deeper networks, more data, and faster training ... – Many libraries build over Theano (Lasagne,

Deep neural networks

Outline •  What’s new in ANNs in the last 5-10 years?

–  Deeper networks, more data, and faster training •  Scalability and use of GPUs ✔ •  Symbolic differentiation

– reverse-mode automatic differentiation –  “Generalized backprop”

•  Some subtle changes to cost function, architectures, optimization methods ✔

•  What types of ANNs are most successful and why? –  Convolutional networks (CNNs) –  Long term/short term memory networks (LSTM) –  Word2vec and embeddings

•  What are the hot research topics for deep learning?

Outline •  What’s new in ANNs in the last 5-10 years?

–  Deeper networks, more data, and faster training •  Scalability and use of GPUs ✔ •  Symbolic differentiation

– reverse-mode automatic differentiation –  “Generalized backprop”

•  Some subtle changes to cost function, architectures, optimization methods ✔

•  What types of ANNs are most successful and why? –  Convolutional networks (CNNs) –  Long term/short term memory networks (LSTM) –  Word2vec and embeddings

•  What are the hot research topics for deep learning?

Recap:mul+layernetworks•  Classifierisamul,layernetworkoflogis-cunits•  Eachunittakessomeinputsandproducesoneoutputusingalogis,cclassifier

•  Outputofoneunitcanbetheinputofanother

Input layer

Output layer

Hidden layer

v1=S(wTX) w0,1

x1

x2

1

v2=S(wTX)

z1=S(wTV)

w1,1

w2,1

w0,2

w1,2

w2,2

w1

w2

Recap:Learningamul+layernetwork

•  Definealosswhichissquarederror– overanetworkoflogis,cunits

•  Minimizelosswithgradientdescent

– outputisnetworkoutput

€

JX,y (w) = y i − ˆ y i( )i∑

2

Recap:weightupdatesformul+layerANN

δk ≡ tk − ak( ) ak 1− ak( )

δ j ≡ δkwkj( )k∑ aj 1− aj( )

For nodes k in output layer:

For nodes j in hidden layer:

For all weights:

“Propagate errors backward” BACKPROP

wkj = wkj −ε δkajwji = wji −ε δ jai

Can carry this recursion out further if you have multiple hidden layers

How can we generalize BackProp to other ANNs?

Generalizingbackprop

•  Star,ngpoint:afunc,onofnvariables

•  Step1:codeyourfunc,onasaseriesofassignments

•  Step2:backpropagatebygoingthruthelistinreverseorder,star,ngwith…

•  …andusingthechainrule

e.g.

Wengert list

return

inputs:

Step 1: forward

Step 1: backprop

https://justindomke.wordpress.com/

A function

Computed in previous step

Recap:logis+cregressionwithSGD

8

Let X be a matrix with k examples Let wi be the input weights for the i-th hidden unit Then Z = X W is output (pre-sigmoid) for all m units for all k examples

w1 w2 w3 … wm

0.1 -0.3 …

-1.7 …

0.3 …

1.2

x1 1 0 1 1

x2 …

…

xk

XW =

x1.w1 x1.w2

… x1.wm

xk.w1 … … xk.wm

There’s a lot of chances to do this in parallel…. with parallel matrix multiplication

Example:2-layerneuralnetwork

return

inputs: Step 1: forward

Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add*(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add*(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function

Step 1: backprop

Target Y; N examples; K outs; D feats, H hidden

X is N*D, W1 is D*H, B1 is 1*H, W2 is H*K, …

Z1a is N*H Z1b is N*H

A1 is N*H

Z2a is N*K Z2b is N*K

A2 is N*K P is N*K

C is a scalar


return


Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add*(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add*(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function

Step 1: backprop

dC/dC = 1 dC/dP = dC/dC * dCrossEntY/dP dC/dA2 = dC/dP * dsoftmax/dA2 dC/Z2b = dC/dA2 * dtanh/dZ2b dC/dZ1a = dC/dZ2b * (dadd*/dZ2a + dadd/dB2) dC/dB2 = dC/Z2b * 1 dC/dZ2a = dC/dZ2b * (dmul/dA1 + dmul/dW2) dC/dW2 = dC/dZ2a * 1 dC/dA1 = …

Target Y; N rows; K outs; D feats, H hidden

N*K (N*K) * (K*K)

(N*K) N*H

−1N

pi − yiyi (1− yi )"

#$

%

&'

i∑


return


Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function

Step 1: backprop

dC/dC = 1 dC/dP = dC/dC * dCrossEntY/dP dC/dA2 = dC/dP * dsoftmax/dA2 dC/Z2b = dC/dA2 * dtanh/dZ2b dC/dZ1a = dC/dZ2b * (dadd/dZ2a + dadd/dB2) dC/dB2 = dC/Z2b * 1 dC/dZ2a = dC/dZ2b * (dmul/dA1 + dmul/dW2) dC/dW2 = dC/dZ2a * 1 dC/dA1 = …


−1N

pi − yiyi (1− yi )"

#$

%

&'

i∑

N*K

N*H

Need a backward form for each matrix operation used in forward


return



Step 1: backprop

dC/dC = 1 dC/dP = dC/dC * dCrossEntY/dP dC/dA2 = dC/dP * dsoftmax/dA2 dC/Z2b = dC/dA2 * dtanh/dZ2b dC/dZ1a = dC/dZ2b * (dadd/dZ2a + dadd/dB2) dC/dB2 = dC/Z2b * 1 dC/dZ2a = dC/dZ2b * (dmul/dA1 + dmul/dW2) dC/dW2 = dC/dZ2a * 1 dC/dA1 = …


N*K

N*H

Need a backward form for each matrix operation used in forward, with respect to each argument


return




N*H

Need a backward form for each matrix operation used in forward, with respect to each argument

An autodiff package usually includes •  A collection of matrix-oriented

operations (mul, add*, …) •  For each operation

•  A forward implementation •  A backward implementation for

each argument

•  It’s still a little awkward to program with a list of assignments so….

Generalizingbackprop

•  Star,ngpoint:afunc,onofnvariables

•  Step1:codeyourfunc,onasaseriesofassignments

Wengert list return

inputs:

Step 1: forward

https://justindomke.wordpress.com/

•  BeHerplan:overloadyourmatrixoperatorssothatwhenyouusethemin-linetheybuildanexpressiongraph

•  ConverttheexpressiongraphtoaWengertlistwhennecessary

Example:Theano

… # generate some training data

Example:Theano

p_1


return



N*H

An autodiff package usually includes •  A collection of matrix-oriented

operations (mul, add*, …) •  For each operation

•  A forward implementation •  A backward implementation

for each argument

•  A way of composing operations into expressions (often using operator overloading) which evaluate to expression trees

•  Expression simplification/compilation

Sometoolsthatuseautodiff•  Theano:

–  UnivMontrealsystem,Python-based,firstversion2007nowv0.8,integratedwithGPUs

–  ManylibrariesbuildoverTheano(Lasagne,Keras,Blocks..)•  Torch:

–  Collobertetal,usedheavilyatFacebook,basedonLua.Similarperformance-wisetoTheano

•  TensorFlow:–  Googlesystem,possiblynotwell-tunedforGPUsystemsusedbyacademics

–  AlsosupportedbyKeras•  Autograd

–  NewsystemwhichisveryPythonic,doesn’tsupportGPUs(yet)•  …

Outline•  What’snewinANNsinthelast5-10years?– Deepernetworks,moredata,andfastertraining•  ScalabilityanduseofGPUs✔•  Symbolicdifferen,a,on✔•  Somesubtlechangestocostfunc,on,architectures,op,miza,onmethods✔

•  WhattypesofANNsaremostsuccessfulandwhy?– Convolu+onalnetworks(CNNs)– Longterm/shorttermmemorynetworks(LSTM)– Word2vecandembeddings

•  Whatarethehotresearchtopicsfordeeplearning?

WHAT’SACONVOLUTIONALNEURALNETWORK?

Modelofvisioninanimals

VisionwithANNs

What’saconvolu+on?

1-D

https://en.wikipedia.org/wiki/Convolution

What’saconvolu+on?

1-D

https://en.wikipedia.org/wiki/Convolution

What’saconvolu+on?http://matlabtricks.com/post-5/3x3-convolution-kernels-with-online-demo






What’saconvolu+on?•  Basicidea:– Picka3x3matrixFofweights– Slidethisoveranimageandcomputethe“innerproduct”(similarity)ofFandthecorrespondingfieldoftheimage,andreplacethepixelinthecenterofthefieldwiththeoutputoftheinnerproductopera,on

•  Keypoint:– Differentconvolu,onsextractdifferenttypesoflow-level“features”fromanimage– AllthatweneedtovarytogeneratethesedifferentfeaturesistheweightsofF

HowdoweconvolveanimagewithanANN?

Note that the parameters in the matrix defining the convolution are tied across all places that it is used

Howdowedomanyconvolu+onsofanimagewithanANN?

Example:6convolu+onsofadigithttp://scs.ryerson.ca/~aharley/vis/conv/

CNNstypicallyalternateconvolu+ons,non-linearity,andthendownsampling

Downsampling is usually averaging or (more common in recent CNNs) max-pooling

Whydomax-pooling?•  Savesspace•  Reducesoverfidng?•  BecauseI’mgoingtoaddmoreconvolu,onsaferit!

–  Allowstheshort-rangeconvolu,onstoextendoverlargersubfieldsoftheimages•  Sowecanspotlargerobjects•  Eg,alonghorizontalline,oracorner,or…

AnotherCNNvisualiza+onhttps://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

Whydomax-pooling?•  Savesspace•  Reducesoverfidng?•  BecauseI’mgoingtoaddmoreconvolu,onsaferit!– Allowstheshort-rangeconvolu,onstoextendoverlargersubfieldsoftheimages•  Sowecanspotlargerobjects•  Eg,alonghorizontalline,oracorner,or…

•  Atsomepointthefeaturemapsstarttogetverysparseandblobby–theyareindicatorsofsomeseman,cproperty,notarecognizabletransforma,onoftheimage

•  Thenjustusethemasfeaturesina“normal”ANN

Whydomax-pooling?•  Savesspace•  Reducesoverfidng?•  BecauseI’mgoingtoaddmoreconvolu,onsaferit!

–  Allowstheshort-rangeconvolu,onstoextendoverlargersubfieldsoftheimages•  Sowecanspotlargerobjects•  Eg,alonghorizontalline,oracorner,or…

Alterna+ngconvolu+onanddownsampling

5 layers up

The subfield in a large dataset that gives the strongest output for a neuron


•  WhattypesofANNsaremostsuccessfulandwhy?– Convolu,onalnetworks(CNNs) ✔– Longterm/shorttermmemorynetworks(LSTM)– Word2vecandembeddings


RECURRENTNEURALNETWORKS

Mo+va+on:whataboutsequencepredic+on?

What can I do when input size and output size vary?

Mo+va+on:whataboutsequencepredic+on?

x

A

h

ArchitectureforanRNN

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Sequence of outputs

Sequence of inputs Start of

sequence marker

End of sequence marker

Some information

is passed from one

subunit to the next

Architectureforan1980’sRNN

…

Problem with this: it’s extremely deep and very hard to train

ArchitectureforanLSTM


Longterm-short term model

σ: output in [0,1] tanh: output in [-1,+1]

Decide what

to forget

Decide what

to insert

Combine with transformed xt

“Bits of memory”

WalkthroughWhat part of

memory to “forget” – zero means forget

this bit

Walkthrough


What bits to insert into the next states

What content to store into the next state

Walkthrough


Next memory cell content – mixture of not-forgotten part of

previous cell and insertion

Walkthrough


What part of cell to output

tanh maps bits to [-1,+1] range

ArchitectureforanLSTM


it

Ct-1

ot

(1)

(2)

(3)

ft

Implemen+nganLSTM


(1)

(2)

(3)

For t = 1,…,T:

SOMEFUNLSTMEXAMPLES

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

LSTMscanbeusedforothersequencetasks


sequence classification translation named entity

recognition image captioning

Character-levellanguagemodel


Test time: •  pick a seed

character sequence

•  generate the next character

•  then the next •  then the next …





Yoav Goldberg: order-10 unsmoothed character n-grams





LaTeX “almost compiles”






WORD2VECANDWORDEMBEDDINGS

Basicideabehindskip-gramembeddings

from an input word w(t) in a document

construct hidden layer that

“encodes” that word

So that the hidden layer will predict

likely nearby words w(t-K), …,

w(t+K)

final step of this prediction is a softmax over lots of outputs

Basicideabehindskip-gramembeddings

Training data: positive examples are pairs of words w(t), w(t+j) that co-occur

Training data: negative examples are samples of pairs of words w(t), w(t+j) that don’t co-occur

You want to train over a very large corpus (100M words+) and hundreds+ dimensions

Resultsfromword2vec

https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html

Resultsfromword2vec


Resultsfromword2vec





Somecurrenthottopics

•  Mul,-tasklearning– Doesithelptolearntopredictmanythingsatonce?e.g.,POStagsandNERtagsinawordsequence?– Similartoword2veclearningtoproduceallcontextwords

•  ExtensionsofLSTMsthatmodelmemorymoregenerally– e.g.forques,onansweringaboutastory


•  Op,miza,onmethods(>>SGD)•  Neuralmodelsthatinclude“aHen,on”– Abilityto“explain”adecision

Examplesofaaen+on

https://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf

Examplesofaaen+on

https://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf

Basic idea: similarly to the way an LSTM chooses what to “forget” and “insert” into memory, allow a network to choose a path to focus on in the visual field


•  Knowledge-baseembedding:extendingword2vectoembedlargedatabasesoffactsabouttheworldintoalow-dimensionalspace.– TransE,TransR,…

•  “NLPfromscratch”:sequence-labelingandotherNLPtaskswithminimalamountoffeatureengineering,onlynetworksandcharacter-orword-levelembeddings


•  Computervision:complextaskslikegenera,nganaturallanguagecap,onfromanimageorunderstandingavideoclip

•  Machinetransla,on– EnglishtoSpanish,…

•  Usingneuralnetworkstoperformtasks– Drivingacar– Playinggames(likeGoor…)

Documents

Deep neural networkswcohen/10-601/deep-2.pdfDeep neural networks. Outline ... – Deeper networks, more data, and faster training ... – Many libraries build over Theano (Lasagne,