75
Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning and Richard Socher Lecture 14: Tree Recursive Neural Networks and Constituency Parsing

Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Natural Language Processing with Deep Learning

CS224N/Ling284

Christopher Manning and Richard Socher

Lecture 14: Tree Recursive Neural Networks and Constituency Parsing

Page 2: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

LecturePlan

1.  Mo%va%on:Composi%onalityandRecursion2.  Structurepredic%onwithsimpleTreeRNN:Parsing3.  Researchhighlight:DeepReinforcementLearningforDialogue

Genera9on4.  Backpropaga%onthroughStructure5.  Morecomplexunits

Reminders/comments:LearnuponGPUs,Azure,DockerAss4:Getsomethingworking,usingaGPUformilestoneFinalprojectdiscussions�comemeetwithus!

Page 3: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

1.ThespectrumoflanguageinCS

3

Page 4: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Seman9cinterpreta9onoflanguage–Notjustwordvectors

Howcanweknowwhenlargerunitsaresimilarinmeaning?

•  Thesnowboarderisleapingoveramogul

•  Apersononasnowboardjumpsintotheair

Peopleinterpretthemeaningoflargertextunits–en%%es,descrip%veterms,facts,arguments,stories–byseman9ccomposi9onofsmallerelements

Page 5: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Compositionality

Page 6: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve
Page 7: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Language understanding – & Artificial Intelligence – requires being able to understand bigger

things from knowing about smaller parts

Page 8: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

8

Page 9: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Arelanguagesrecursive?

•  Cogni%velysomewhatdebatable•  But:recursionisnaturalfordescribinglanguage•  [Themanfrom[thecompanythatyouspokewithabout[the

project]yesterday]]•  nounphrasecontaininganounphrasecontaininganounphrase•  Argumentsfornow:1)Helpfulindisambigua%on:

Page 10: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Isrecursionuseful?

2)Helpfulforsometaskstorefertospecificphrases:•  JohnandJanewenttoabigfes%val.Theyenjoyedthetripandthemusicthere.

•  “they”:JohnandJane•  “thetrip”:wenttoabigfes%val•  “there”:bigfes%val

3)Worksbe^erforsometaskstousegramma%caltreestructure•  It’sapowerfulpriorforlanguagestructure

Page 11: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

BuildingonWordVectorSpaceModels

11

x2

x1012345678910

5

4

3

2

1Monday

92

Tuesday 9.51.5

Bymappingthemintothesamevectorspace!

15

1.14

thecountryofmybirththeplacewhereIwasborn

Howcanwerepresentthemeaningoflongerphrases?

France 22.5

Germany 13

Page 12: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Howshouldwemapphrasesintoavectorspace?

thecountryofmybirth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

Useprincipleofcomposi%onalityThemeaning(vector)ofasentenceisdeterminedby(1) themeaningsofitswordsand(2) therulesthatcombinethem.

Modelsinthissec%oncanjointlylearnparsetreesandcomposi%onalvectorrepresenta%ons

x2

x1012345678910

5

4

3

2

1

thecountryofmybirth

theplacewhereIwasborn

Monday

Tuesday

FranceGermany

12

Page 13: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Cons9tuencySentenceParsing:Whatwewant

91

53

85

91

43

NPNP

PP

S

71

VP

Thecatsatonthemat.13

Page 14: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

LearnStructureandRepresenta9on

NPNP

PP

S

VP

52 3

3

83

54

73

Thecatsatonthemat.

91

53

85

91

43

71

14

Page 15: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Recursivevs.recurrentneuralnetworks

3/2/17

thecountryofmybirth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

thecountryofmybirth

0.40.3

2.33.6

44.5

77

2.13.3

4.53.8

5.56.1

13.5

15

2.53.8

Page 16: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Recursivevs.recurrentneuralnetworks

3/2/17RichardSocher

•  Recursiveneuralnetsrequireaparsertogettreestructure

•  Recurrentneuralnetscannotcapturephraseswithoutprefixcontextandohencapturetoomuchoflastwordsinfinalvector

thecountryofmybirth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

thecountryofmybirth

0.40.3

2.33.6

44.5

77

2.13.3

4.53.8

5.56.1

13.5

15

2.53.8

Page 17: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

FromRNNstoCNNs

3/2/17RichardSocher

•  RNN:Getcomposi%onalvectorsforgramma%calphrasesonly

•  CNN:Computesvectorsforeverypossiblephrase•  Example:“thecountryofmybirth”computesvectorsfor:

•  thecountry,countryof,ofmy,mybirth,thecountryof,countryofmy,ofmybirth,thecountryofmy,countryofmybirth

•  Regardlessofwhethereachisgramma%cal–manydon’tmakesense

•  Don’tneedparser•  Butmaybenotverylinguis%callyorcogni%velyplausible

Page 18: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Rela9onshipbetweenRNNsandCNNs

•  CNN RNN

3/2/17RichardSocher

Page 19: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Rela9onshipbetweenRNNsandCNNs

•  CNN RNN

peopletherespeakslowlypeopletherespeakslowly

3/2/17RichardSocher

Page 20: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

2.RecursiveNeuralNetworksforStructurePredic9on

onthemat.

91

43

33

83

85

33

Neural Network

83

1.3

Inputs:twocandidatechildren’srepresenta%onsOutputs:1.  Theseman%crepresenta%onifthetwonodesaremerged.2.  Scoreofhowplausiblethenewnodewouldbe.

85

20

Page 21: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

RecursiveNeuralNetworkDefini9on

score=UTp

p=tanh(W+b),

SameWparametersatallnodesofthetree

85

33

Neural Network

83

1.3score= =parent

c1c2

c1c2

21

Page 22: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

ParsingasentencewithanRNN

Neural Network

0.120

Neural Network

0.410

Neural Network

2.333

91

53

85

91

43

71

Neural Network

3.152

Neural Network

0.301

Thecatsatonthemat.

22

Page 23: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Parsingasentence

91

53

52

Neural Network

1.121

Neural Network

0.120

Neural Network

0.410

Neural Network

2.333

53

85

91

43

71

23

Thecatsatonthemat.

Page 24: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Parsingasentence

52

Neural Network

1.121

Neural Network

0.120

33

Neural Network

3.683

91

5353

85

91

43

71

24

Thecatsatonthemat.

Page 25: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Parsingasentence

52

33

83

54

73

91

5353

85

91

43

71

25Thecatsatonthemat.

Page 26: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Max-MarginFramework-Details

•  Thescoreofatreeiscomputedbythesumoftheparsingdecisionscoresateachnode:

•  xissentence;yisparsetree

85

33

RNN

831.3

26

Page 27: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Max-MarginFramework-Details

•  Similartomax-marginparsing(Taskaretal.2004),asupervisedmax-marginobjec%ve

•  Thelosspenalizesallincorrectdecisions

•  StructuresearchforA(x)wasgreedy(joinbestnodeseach%me)•  Instead:Beamsearchwithchart

27

Page 28: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Backpropaga9onThroughStructure

IntroducedbyGoller&Küchler(1996)Principallythesameasgeneralbackpropaga%onThreedifferencesresul%ngfromtherecursionandtreestructure:

1.  Sumderiva%vesofWfromallnodes(likeRNN)2.  Splitderiva%vesateachnode(fortree)3.  Adderrormessagesfromparent+nodeitself

28

The second derivative in eq. 28 for output units is simply

@a

(nl

)i

@W

(nl

�1)ij

=@

@W

(nl

�1)ij

z

(nl

)i

=@

@W

(nl

�1)ij

⇣W

(nl

�1)i· a

(nl

�1)⌘= a

(nl

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

@E

n

@W

(nl

�1)ij

= (yi

� t

i

)a(nl

�1)j

= �

(nl

)i

a

(nl

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

@E

@W

(nl

�2)ij

=X

n

@E

n

@a

(nl

)| {z }�

(nl

)

@a

(nl

)

@W

(nl

�2)ij

+ �W

(nl

�2)ji

. (48)

Now,

(�(nl

))T@a

(nl

)

@W

(nl

�2)ij

= (�(nl

))T@z

(nl

)

@W

(nl

�2)ij

(49)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)a

(nl

�1) (50)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)·i a

(nl

�1)i

(51)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

a

(nl

�1)i

(52)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(W (nl

�2)i· a

(nl

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

)a(nl

�2)j

(55)

=⇣(�(nl

))TW (nl

�1)·i

⌘f

0(z(nl

�1)i

)a(nl

�2)j

(56)

=

0

@s

l+1X

j=1

W

(nl

�1)ji

(nl

)j

)

1

Af

0(z(nl

�1)i

)

| {z }

a

(nl

�2)j

(57)

= �

(nl

�1)i

a

(nl

�2)j

(58)

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

7

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

@

@W

(l)ij

E

R

= a

(l)j

(l+1)i

+ �W

(l)ij

(60)

(61)

Which in one simplified vector notation becomes:

@

@W

(l)E

R

= �

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

n

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

(nl

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to [email protected]

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

8

Page 29: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

BTS:1)Sumderiva9vesofallnodes

Youcanactuallyassumeit’sadifferentWateachnodeIntui%onviaexample:Ifwetakeseparatederiva%vesofeachoccurrence,wegetsame:

29

Page 30: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

BTS:2)Splitderiva9vesateachnode

Duringforwardprop,theparentiscomputedusing2children

Hence,theerrorsneedtobecomputedwrteachofthem:

whereeachchild’serrorisn-dimensional

85

33

83

c1p=tanh(W+b)c1

c2c2

85

33

83

c1 c2

30

Page 31: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

BTS:3)Adderrormessages

•  Ateachnode:• Whatcameup(fprop)mustcomedown(bprop)•  Totalerrormessages=errormessagesfromparent+errormessagefromownscore

3/2/17RichardSocherLecture1,Slide31

85

33

83

c1 c2

parentscore

Page 32: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

BTSPythonCode:forwardProp

3/2/17RichardSocherLecture1,Slide32

Page 33: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

BTSPythonCode:backProp

3/2/17RichardSocherLecture1,Slide33

The second derivative in eq. 28 for output units is simply

@a

(nl

)i

@W

(nl

�1)ij

=@

@W

(nl

�1)ij

z

(nl

)i

=@

@W

(nl

�1)ij

⇣W

(nl

�1)i· a

(nl

�1)⌘= a

(nl

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

@E

n

@W

(nl

�1)ij

= (yi

� t

i

)a(nl

�1)j

= �

(nl

)i

a

(nl

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

@E

@W

(nl

�2)ij

=X

n

@E

n

@a

(nl

)| {z }�

(nl

)

@a

(nl

)

@W

(nl

�2)ij

+ �W

(nl

�2)ji

. (48)

Now,

(�(nl

))T@a

(nl

)

@W

(nl

�2)ij

= (�(nl

))T@z

(nl

)

@W

(nl

�2)ij

(49)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)a

(nl

�1) (50)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)·i a

(nl

�1)i

(51)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

a

(nl

�1)i

(52)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(W (nl

�2)i· a

(nl

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

)a(nl

�2)j

(55)

=⇣(�(nl

))TW (nl

�1)·i

⌘f

0(z(nl

�1)i

)a(nl

�2)j

(56)

=

0

@s

l+1X

j=1

W

(nl

�1)ji

(nl

)j

)

1

Af

0(z(nl

�1)i

)

| {z }

a

(nl

�2)j

(57)

= �

(nl

�1)i

a

(nl

�2)j

(58)

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

7

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

@

@W

(l)ij

E

R

= a

(l)j

(l+1)i

+ �W

(l)ij

(60)

(61)

Which in one simplified vector notation becomes:

@

@W

(l)E

R

= �

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

n

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

(nl

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to [email protected]

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

8

Page 34: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

BTS:Op9miza9on

•  Asbefore,wecanplugthegradientsintoastandardoff-the-shelfL-BFGSop%mizerorSGD

•  BestresultswithAdaGrad(Duchietal,2011):

•  Fornon-con%nuousobjec%veusesubgradientmethod(Ratliffetal.2007)

34

Page 35: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Deep Reinforcement Learning for Dialogue Generation Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao and Dan Jurafsky

Page 36: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Seq2Seq for Dialogue

Encode previous message(s) into vector

How are you

I

_

am

I

fine

am

Decode vector into response

Page 37: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Seq2Seq for Dialogue

Encode previous message(s) into vector

How are you

I

_

am

I

fine

am

Decode vector into response

Train by maximizing

p(response|input)

where the response is produced by a human

Page 38: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Problems with Seq2Seq

How old are you?

Page 39: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Problems with Seq2Seq

How old are you?

16?

I’m 16

Page 40: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Problems with Seq2Seq

How old are you?

16?

I’m 16

I don’t know what you’re talking about

You don’t know what you’re saying

I don’t know what you’re talking about

You don’t know what you’re saying

Page 41: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Problems with Seq2Seq

How old are you?

16?

I’m 16

I don’t know what you’re talking about

You don’t know what you’re saying

I don’t know what you’re talking about

You don’t know what you’re saying

reasonable, but unhelpful

generic

Page 42: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

probable response != good response

Page 43: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

What is a good response?

•  Reasonable p(response|input) is high according to seq2seq model

•  Nonrepetitive similarity between response and previous messages is low

•  Easy to answer p(“i don’t know”|response) is low

Page 44: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

What is a good response?

•  Reasonable p(response|input) is high according to seq2seq model

•  Nonrepetitive similarity between response and previous messages is low

•  Easy to answer p(“i don’t know”|response) is low

Scoring function: R(response) = reasonable_score + nonrepetitive_score + easy_to_answer_score

Page 45: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Reinforcement Learning

Learn from rewards instead of from examples

1. Encode input into a vector

How are you

Page 46: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Reinforcement Learning

Learn from rewards instead of from examples

2. Have the system generate a response

How are you

I

_

don’t

I

know

don’t

Page 47: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Reinforcement Learning

3. Receive reward R(response)

- Train system to maximize reward

Learn from rewards instead of from examples

R = -5 How are you

I

_

don’t

I

know

don’t

Page 48: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Quantitative Results

Page 49: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Qualitative Results

How old are you?

I thought you were 12

I’m 16. Why are you asking?

What made you think so?

Page 50: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Qualitative Results

How old are you?

I thought you were 12

I’m 16. Why are you asking?

What made you think so?

You don’t know what you’re saying

I don’t know what you’re talking about

You don’t know what you’re saying

Page 51: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Conclusion

•  Reinforcement learning useful when we want our model to do more than produce a probable human label

•  Many more application of RL to NLP!

Information extraction, question answering, task-oriented dialogue, coreference resolution, and more

Page 52: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Discussion:SimpleRNN•  DecentresultswithsinglematrixTreeRNN

•  SingleweightmatrixTreeRNNcouldcapturesomephenomenabutnotadequateformorecomplex,higherordercomposi%onandparsinglongsentences

•  Thereisnorealinterac%onbetweentheinputwords

•  Thecomposi%onfunc%onisthesameforallsyntac%ccategories,punctua%on,etc. W

c1 c2

pWscore s

Page 53: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Version2:Syntac9cally-Un9edRNN

•  AsymbolicContext-FreeGrammar(CFG)backboneisadequateforbasicsyntac%cstructure

•  Weusethediscretesyntac%ccategoriesofthechildrentochoosethecomposi%onmatrix

•  ATreeRNNcandobe^erwithdifferentcomposi%onmatrixfordifferentsyntac%cenvironments

•  Theresultgivesusabe^erseman%cs

Page 54: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Composi9onalVectorGrammars

•  Problem:Speed.Everycandidatescoreinbeamsearchneedsamatrix-vectorproduct.

•  Solu%on:Computescoreonlyforasubsetoftreescomingfromasimpler,fastermodel(PCFG)• Prunesveryunlikelycandidatesforspeed• Providescoarsesyntac%ccategoriesofthechildrenforeachbeamcandidate

•  Composi%onalVectorGrammar=PCFG+TreeRNN

Page 55: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Details:Composi9onalVectorGrammar

•  Scoresateachnodecomputedbycombina%onofPCFGandSU-RNN:

•  Interpreta%on:Factoringdiscreteandcon%nuousparsinginonemodel:

•  Socheretal.(2013)

Page 56: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Relatedworkforrecursiveneuralnetworks

Pollack(1990):Recursiveauto-associa%vememoriesPreviousRecursiveNeuralNetworksworkbyGoller&Küchler(1996),Costaetal.(2003)assumedfixedtreestructureandusedonehotvectors.Hinton(1990)andBo^ou(2011):Relatedideasaboutrecursivemodelsandrecursiveoperatorsassmoothversionsoflogicopera%ons

56

Page 57: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

RelatedWorkforparsing

•  Resul%ngCVGParserisrelatedtopreviousworkthatextendsPCFGparsers

•  KleinandManning(2003a):manualfeatureengineering•  Petrovetal.(2006):learningalgorithmthatsplitsandmerges

syntac%ccategories•  Lexicalizedparsers(Collins,2003;Charniak,2000):describeeach

categorywithalexicalitem•  HallandKlein(2012)combineseveralsuchannota%onschemesina

factoredparser.•  CVGsextendtheseideasfromdiscreterepresenta%onstoricher

con%nuousones

Page 58: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Experiments•  StandardWSJsplit,labeledF1•  BasedonsimplePCFGwithfewerstates•  Fastpruningofsearchspace,fewmatrix-vectorproducts•  3.8%higherF1,20%fasterthanStanfordfactoredparser

Parser Test,AllSentences

StanfordPCFG,(KleinandManning,2003a) 85.5

StanfordFactored(KleinandManning,2003b) 86.6

FactoredPCFGs(HallandKlein,2012) 89.4

Collins(Collins,1997) 87.7

SSN(Henderson,2004) 89.4

BerkeleyParser(PetrovandKlein,2007) 90.1

CVG(RNN)(Socheretal.,ACL2013) 85.0

CVG(SU-RNN)(Socheretal.,ACL2013) 90.4

Charniak-SelfTrained(McCloskyetal.2006) 91.0

Charniak-SelfTrained-ReRanked(McCloskyetal.2006) 92.1

Page 59: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

SU-RNN/CVG[Socher,Bauer,Manning,Ng2013]

Learnssohno%onofheadwordsIni%aliza%on:

NP-CC

NP-PP PP-NP

PRP$-NP

Page 60: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

SU-RNN/CVG[Socher,Bauer,Manning,Ng2013]

ADJP-NP

ADVP-ADJP

JJ-NP

DT-NP

Page 61: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Analysisofresul9ngvectorrepresenta9ons

Allthefiguresareadjustedforseasonalvaria%ons1.Allthenumbersareadjustedforseasonalfluctua%ons2.Allthefiguresareadjustedtoremoveusualseasonalpa^erns

Knight-Ridderwouldn’tcommentontheoffer1.Harscodeclinedtosaywhatcountryplacedtheorder2.Coastalwouldn’tdisclosetheterms

Salesgrewalmost7%to$UNKm.from$UNKm.1.Salesrosemorethan7%to$94.9m.from$88.3m.2.Salessurged40%toUNKb.yenfromUNKb.

Page 62: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

SU-RNNAnalysis

•  Cantransferseman%cinforma%onfromsinglerelatedexample

•  Trainsentences:• Heeatsspaghewwithafork.• Sheeatsspaghewwithpork.

•  Testsentences• Heeatsspaghewwithaspoon.• Heeatsspaghewwithmeat.

Page 63: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

SU-RNNAnalysis

Page 64: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

LabelinginRecursiveNeuralNetworks

Neural Network

83

• Wecanuseeachnode’srepresenta%onasfeaturesforasoJmaxclassifier:

•  Trainingsimilartomodelinpart1withstandardcross-entropyerror+scores

SoftmaxLayer

NP

64

Page 65: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Version3:Composi9onalityThroughRecursiveMatrix-VectorSpaces

Onewaytomakethecomposi%onfunc%onmorepowerfulwasbyuntyingtheweightsWButwhatifwordsactmostlyasanoperator,e.g.“very”in

verygoodProposal:Anewcomposi%onfunc%on

p=tanh(W+b)

c1c2

Before:

Page 66: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Version3:Matrix-vectorRNNs[Socher,Huval,Bhat,Manning,&Ng,2012]

p

Page 67: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Composi9onalityThroughRecursiveMatrix-VectorRecursiveNeuralNetworks

p=tanh(W+b)

c1c2 p=tanh(W+b)

C2c1C1c2

67

Page 68: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Matrix-vectorRNNs[Socher,Huval,Bhat,Manning,&Ng,2012]

p=

AB

=P

Page 69: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Predic9ngSen9mentDistribu9onsGoodexamplefornon-linearityinlanguage

69

Page 70: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Classifica9onofSeman9cRela9onships

•  CananMV-RNNlearnhowalargesyntac%ccontextconveysaseman%crela%onship?

•  My[apartment]e1hasapre^ylarge[kitchen]e2 àcomponent-wholerela%onship(e2,e1)

•  Buildasinglecomposi%onalseman%csfortheminimalcons%tuentincludingbothterms

Page 71: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Classifica9onofSeman9cRela9onships

Classifier Features F1SVM POS,stemming,syntac%cpa^erns 60.1MaxEnt POS,WordNet,morphologicalfeatures,noun

compoundsystem,thesauri,Googlen-grams77.6

SVM POS,WordNet,prefixes,morphologicalfeatures,dependencyparsefeatures,Levinclasses,PropBank,FrameNet,NomLex-Plus,Googlen-grams,paraphrases,TextRunner

82.2

RNN – 74.8MV-RNN – 79.1MV-RNN POS,WordNet,NER 82.4

Page 72: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

SceneParsing

•  Themeaningofasceneimageisalsoafunc%onofsmallerregions,

•  howtheycombineaspartstoformlargerobjects,

•  andhowtheobjectsinteract.

Similarprincipleofcomposi%onality.

72

Page 73: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

AlgorithmforParsingImages

SameRecursiveNeuralNetworkasfornaturallanguageparsing!(Socheretal.ICML2011)

Features

Grass Tree

Segments

SemanticRepresentations

People Building

ParsingNaturalSceneImagesParsingNaturalSceneImages

73

Page 74: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

Mul9-classsegmenta9on

Method Accuracy

PixelCRF(Gouldetal.,ICCV2009) 74.3

Classifieronsuperpixelfeatures 75.9

Region-basedenergy(Gouldetal.,ICCV2009) 76.4

Locallabelling(Tighe&Lazebnik,ECCV2010) 76.9

SuperpixelMRF(Tighe&Lazebnik,ECCV2010) 77.5

SimultaneousMRF(Tighe&Lazebnik,ECCV2010) 77.5

RecursiveNeuralNetwork 78.1

StanfordBackgroundDataset(Gouldetal.2009)74

Page 75: Natural Language Processing with Deep Learning CS224N/Ling284 · 2019-01-01 · Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning ... en%%es, descrip%ve

QCD-AwareRecursiveNeuralNetworksforJetPhysicsGillesLouppe,KyunghunCho,CyrilBecot,KyleCranmer