50
1 Language Representation and Modeling Kapil Thadani [email protected] RESEARCH

KapilThadani [email protected] - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

1

Language Representation and Modeling

Kapil [email protected]

RESEARCH

Page 2: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

2

Schedule◦ Language representation and modeling . . . . . . . . . . . . . . . . . . .March 1

- Sparse and distributional representations- word2vec and doc2vec- Task-specific representations- Language modeling

◦ Encoder-decoder frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . March 8- RNN units: LSTMs, GRUs- Sequence-to-sequence models- Attention mechanism- Copying mechanism- Scheduled sampling

◦ Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .March 22- Parsing- Question-answering- Entailment- Machine reading- Dialog systems

Page 3: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

3

Outline

◦ Language representation- Bag of words- Distributional hypothesis- word2vec- doc2vec

◦ Task-specific representations- Convolutional NNs- Recurrent NNs- Recursive NNs

◦ Language modeling- Probabilistic and discriminative LMs- NNLM- RNNLMs

Page 4: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

4

Formal language

(i) Set of sequences over symbols from an alphabet

sentences words vocabulary

utterances morphemes

documents MWEs...

...

(ii) Rules for valid sequences

spelling orthography, morphology

grammar syntax

meaning semantics, discourse, pragmatics, · · ·

Page 5: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

4

Natural language

(i) Set of sequences over symbols from an alphabet

sentences words vocabulary

utterances morphemes

documents MWEs...

...

(ii) Rules for valid sequences

spelling orthography, morphology

grammar syntax

meaning semantics, discourse, pragmatics, · · ·

Page 6: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

5

Lexical semantics

dog

Page 7: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

5

Lexical semantics

dog

mammal pet

hypernymy

poodle puppy

hyponymy

caninesynonymy

pack

meronymy

paw

holonymy

Page 8: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

5

Lexical semantics

dog

mammal pet

hypernymy

poodle puppy

hyponymy

caninesynonymy

pack

meronymy

paw

holonymy

cat

oppositionbark

leash

co-occurrence

dawg

slang

Page 9: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

5

Lexical semantics

dog

mammal pet

hypernymy

poodle puppy

hyponymy

noun caninesynonymy

pack

meronymy

paw

holonymy

cat

oppositionbark

leash

co-occurrence

dawg

slang

wretchfrankfurter

verb aggravateshadow

polysemy

name “Dog theBountyHunter”

Page 10: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

6

Sparse representations

Word: one-hot (1-of-V ) vectorsDocument: "bag of words"

Emphasize rare words with inverse documentfrequency (IDF)

Compare documents with cosine similarity

0...1...0

a

dog

zygote

|V |

+ Simple and interpretable− No notion of word order− No implicit semantics− Curse of dimensionality with large |V |

Page 11: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

7

Distributional approaches

Words that occur in similar contexts have similar meanings

e.g., record word co-occurrence within a contextwindow over a large corpus

Weight association with pointwise mutualinformation (PMI), etc

PMI(w1, w2) = log2

p(w1, w2)

p(w1)p(w2)

dog:

0.3:

0.1:

0.2:

bark

leash

wag

|V |

Page 12: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

8

Latent Semantic Analysis Deerwester et al. (1990)

Construct term-document matrix

M =

|D|

|V |

w(1)1 w

(2)1 · · ·

w(1)2

. . .

...

Singular value decomposition

M ≈ u1 u2 u3 ..

λ1

λ2

λ3

. . .

v1v2v3

:

k

k

Select top k singular vectors for k-dim embeddings of words/docs

Page 13: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

9

word2vec Mikolov et al. (2013)Continuous Bag-of-Words (CBOW)

- Predict target wt given context wt−c, . . . , wt−1, wt+1, . . . , wt+c

wt

`(W,U) = − log p(wt|wt−c · · ·wt+c)

· · · · · ·wt−c wt−1 wt+1 wt+c

W W W W

U

ht =1

2c

c∑i=−ci 6=0

W>wt+i

p(wj |wt−c · · ·wt+c) =eUjht∑k e

Ukht

Input

Projection(averaged)

Softmax

Loss

Label

Page 14: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

10

word2vec Mikolov et al. (2013)Skip-gram

- Predict context wt−c, . . . , wt−1, wt+1, . . . , wt+c given target wt

wt−c wt−1 wt+1 wt+c· · · · · ·

`(W,U) = −c∑

i=−ci 6=0

log p(wt+i|wt)

· · · · · ·

wt

W

U U U U

ht = W>wt

p(wj |wt) =eUjht∑k e

Ukht

Input

Projection

Softmax

Loss

Label

Page 15: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

11

word2vec Mikolov et al. (2013)

Cost of computing ∇p(wj | · · · ) is proportional to V !

Alternative 1: Hierarchical softmax- Predict path in binary tree representation of output layer- Reduces to log2(V ) binary decisions

p(wt = “dog”| · · · ) = (1− σ(U0ht))× σ(U1ht)× σ(U4ht)

0

1 2

3 4 5 6

cow duck cat dog she he and the

Page 16: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

12

word2vec Mikolov et al. (2013)

Cost of computing ∇p(wj | · · · ) is proportional to V !

Alternative 2: Negative sampling- Change objective to differentiate target vector from noisy sampleswith logistic regression

max log σ(u>j ht) +

K∑k=1

Ewm∼Ψ log σ(−um>ht)

where uj = Uj = j’th column of Uand wj ∈ context(wt)

- Noise distribution Ψ typically unigram, uniform or in between- Number of samples K typically 5–20

Page 17: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

13

word2vec Mikolov et al. (2013)

Linear relationships between related words

Page 18: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

13

word2vec Mikolov et al. (2013)

Visualizing lexical relationships

Page 19: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

13

word2vec Mikolov et al. (2013)

Analogical reasoning

Page 20: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

13

word2vec Mikolov et al. (2013)

Phrase analogies

Page 21: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

13

word2vec Mikolov et al. (2013)

Additive compositionality

Page 22: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

14

word2vec Levy & Goldberg (2014)

Skip-gram with negative sampling increases u>j ht for realword-context pairs 〈wt, wj〉 and decreases it for noise pairs

Given:· a matrix of d-dim word vectors W (|Vw| × d)· a matrix of d-dim context vectors U (|Vu| × d)

Skip-gram is implicitly factorizing the matrix M = WU>

What is M?- Word-context matrix where each cell (i, j) contains PMI(wi, wj)- If number of negative samples K > 1, this is shifted by a constant− logK

- (Assuming large enough d and iterations)

Page 23: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

15

Beyond words

Can we add word vectors to make sentence/paragraph/doc vectors?doc A = a1 + a2 + a3

doc B = b1 + b2 + b3

cos(A,B) =A ·B

||A|| · ||B||

=1

||A|| · ||B||(a1 · b1 + a1 · b2 + a1 · b3+a2 · b1 + a2 · b2 + a2 · b3+

a3 · b1 + a3 · b2 + a3 · b3)

= weighted all-pairs similarity over A and B

Page 24: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

16

Paragraph vector (a.k.a doc2vec) Le & Mikolov (2014)Distributed memory

- Predict target wt given context wt−c, . . . , wt−1, wt+1, . . . , wt+c anddoc label dk

- At test time, hold U , W fixed and back-prop into expanded D

wt

`(W,U) = − log p(wt|wt−c · · ·wt+c, dk)

· · · · · ·dk wt−c wt−1 wt+1 wt+c

D W W W W

U

Input

Projection(concatenated)

Softmax

Loss

Label

Page 25: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

17

Paragraph vector (a.k.a doc2vec) Le & Mikolov (2014)Distributed Bag-of-Words (DBOW)

- Predict target n-grams wt, . . . , wt+c given doc label dk- At test time, hold U fixed and back-prop into expanded D

wt wt+c· · ·

`(W,U) = −c∑i=0

log p(wt+i|dk)

· · ·

dk

D

U U

Input

Projection

Softmax

Loss

Label

Page 26: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

18

Semantics are elusive

Visitors saw her duck with binoculars .

Did she duck or does she have a duck?

Who has the binoculars?

How many pairs of binoculars are there?

Page 27: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

18

Semantics are elusive

Visitors saw her duck with binoculars .

Did she duck or does she have a duck?

Who has the binoculars?

How many pairs of binoculars are there?

Page 28: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

18

Semantics are elusive

Visitors saw her duck with binoculars .NNS VBD PRP$ NN IN NNS .

Did she duck or does she have a duck?

Who has the binoculars?

How many pairs of binoculars are there?

Page 29: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

18

Semantics are elusive

Visitors saw her duck with binoculars .NNS VBD PRP VB IN NNS .

Did she duck or does she have a duck?

Who has the binoculars?

How many pairs of binoculars are there?

Page 30: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

18

Semantics are elusive

Visitors saw her duck with binoculars .NNS VBD PRP VB IN NNS .

Did she duck or does she have a duck?

Who has the binoculars?

How many pairs of binoculars are there?

Page 31: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

18

Semantics are elusive

Visitors saw her duck with binoculars .NNS VBD PRP VB IN NNS .

ccomp

Did she duck or does she have a duck?

Who has the binoculars?

How many pairs of binoculars are there?

Page 32: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

18

Semantics are elusive

Visitors saw her duck with binoculars .NNS VBD PRP VB IN NNS .

ccomp

Did she duck or does she have a duck?

Who has the binoculars?

How many pairs of binoculars are there?

Page 33: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

18

Semantics are elusive

Visitors saw her duck with binoculars .NNS VBD PRP$ NN IN NNS .

dobj

Did she duck or does she have a duck?

Who has the binoculars?

How many pairs of binoculars are there?

Page 34: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

18

Semantics are elusive

Visitors saw her duck with binoculars .NNS VBD PRP$ NN IN NNS .

dobj

Did she duck or does she have a duck?

Who has the binoculars?

How many pairs of binoculars are there?

Page 35: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

18

Semantics are elusive

Visitors saw her duck with binoculars .NNS VBD PRP$ NN IN NNS .

dobj

Did she duck or does she have a duck?

Who has the binoculars?

How many pairs of binoculars are there?

Page 36: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

19

Task-specific representations

Many NLP tasks fall into classification or sequence tagging

Classification- Given variable-length text w1 · · ·wn (sentence, document, etc), findlabel y

- Normal discriminative approach:· Extract features over the input text· Train a linear classifier

Tagging- Given variable-length text w1 · · ·wn, label spans z1, . . . , zm

- Normal discriminative approach:· Distribute labels over input words (e.g., BIO, BILOU encodings)· Extract features over input words· Train a linear-chain conditional random field

Page 37: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

20

Convolutional NNs Collobert et al (2011)SENNA

- Part-of-speech tagging- Chunking- Named entity recognition- Semantic role labeling

Page 38: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

21

Convolutional NNs Kim (2014)

- Max-over-time pooling- Two input embedding “channels” — one updated during training

Page 39: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

22

Convolutional NNs Lei et al (2015)R-CNNs

- Capture interactions between non-consecutive n-gramse.g., not bad, not so bad, not nearly as bad

- Each n-gram context vector includes a weighted average over priorn-gram states

cij = U>(W1xi �W2xj)

ct =∑i<t

λtcit

ht = tanh(ct + b)

- Exponential computations avoided by dynamic programming

- Averaging weight learned as a recurrent gate

λt = σ(Wλxt + Uλht−1 + b)

Page 40: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

23

Recurrent NNs

y

x1 x2 x3 x4

U U U U

W W W W

- Flexible models for classification and/or tagging- Typically used with LSTM units or GRUs- Bidirectional RNNs better for long inputs

Page 41: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

23

Recurrent NNs

x1 x2 x3 x4

U U U U

W W W W

z1 z2 z3 z4

O O O O

- Flexible models for classification and/or tagging- Typically used with LSTM units or GRUs- Bidirectional RNNs better for long inputs

Page 42: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

23

Recurrent NNs

x1 x2 x3 x4

U U U U

W W W W

z1 z2 z3 z4

O O O O

U ′ U ′ U ′ U ′

W ′ W ′ W ′ W ′

O′ O′ O′ O′

- Flexible models for classification and/or tagging- Typically used with LSTM units or GRUs- Bidirectional RNNs better for long inputs

Page 43: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

24

Recursive NNs (a.k.a Tree-RNNs) Tai et al (2015)

x1

x2

x3

x4 x5

z1

z2

z3

z4 z5

- Computation graph follows dependency or constituent parse- Child-sum:

· Good for arbitrary fan-out or unordered children· Suited to dependency trees (input xi is head word)

- N -ary:· Fixed number of children, each parameterized separately· Suited to binarized constituency parses (leaves take word inputs xi)

Page 44: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

24

Recursive NNs (a.k.a Tree-RNNs) Socher et al (2013)

- Computation graph follows dependency or constituent parse- Child-sum:

· Good for arbitrary fan-out or unordered children· Suited to dependency trees (input xi is head word)

- N -ary:· Fixed number of children, each parameterized separately· Suited to binarized constituency parses (leaves take word inputs xi)

Page 45: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

25

Text generationProbabilistic language modeling

- Distribution over sequences of words p(w1, . . . , wT ) in a language- Typically made tractable via conditional independence assumptions

p(w1, . . . , wn) =

T∏t=1

p(wt|wt−1, . . . wt−n)

- n-gram counts estimated from large corpora- Distributions smoothed to tolerate data sparsity, e.g., Laplace(add-one) smoothing, Kneser-Ney smoothing

- Evaluate on perplexity over held-out data

21N

∑Ni=1 log2 p

(w

(i)1 ...w

(i)Ti

)

Discriminative language modeling- Estimate n-gram probabilities with a discriminative model

p(wt|wt−1, . . . w1) ≈ f(w1, . . . , wt)

Page 46: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

26

NNLM Bengio et al (2003)

Model p(wt|wt−1, . . . wt−n) with a feed-forward neural net

Page 47: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

27

RNNLM Mikolov et al (2010)

Model p(wt|wt−1, . . . wt−n) with an RNNOr an ensemble of multiple RNNs, randomly initialized

Results on Penn Treebank corpus

Page 48: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

27

RNNLM Mikolov et al (2010)

Model p(wt|wt−1, . . . wt−n) with an RNNOr an ensemble of multiple RNNs, randomly initialized

Comparison of single RNN vs RNN ensembles

Page 49: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

28

RNNLMs Jozefowicz et al (2015)

Recent models with character-CNN inputs and softmax alternatives

Page 50: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture7_kapil.pdf · 2020-03-18 · Lexicalsemantics 5 dog mammal pet hypernymy poodle puppy hyponymy

28

RNNLMs Jozefowicz et al (2015)

Nearest neighbors in character-CNN embedding space