Deep Learning for Sentence Representationpeople.csail.mit.edu/jrg/meetings/ibm-internship-summary...Deep Learning for Sentence Representation Internship Project Summary Yonatan Belinkov

Deep Learning for Sentence Representation

Internship Project Summary

Yonatan Belinkov IBM Research - Haifa Summer 2015

Goals •  Develop deep learning methods for representing

natural language sentences from text

•  Acquire knowledge in deep learning tools and techniques

Background •  Vector representations (embeddings) for words and

sentences


sentences

•  Supervised vs unsupervised approaches


sentences •  Supervised vs unsupervised approaches

•  Neural network architectures Recursive (RecNN) Convolutional (CNN) Recurrent (RNN)

RecNN

The cat sat on the mat

NP NP

PP

VP

S

CNN


RNN


Autoencoder Formulation •  Given a sentence that is a sequence of word vectors

w1...wn, each of dimension d: §  Encode the sentence into a single vector representation §  Decode the representation back into the sentence



•  During training §  Get feedback from original sentence, propogate in the network

to learn parameters



•  During training §  Get feedback from original sentence, propogate in the network

to learn parameters •  During testing

§  Compare decoded sentence to original one

Basic RNN model •  LSTM encoder-decoder (from Li, 2015)

CNN Encoder

Time dimension

Word dimension

w11 w12 … w1d

w21 w22 … w2d

… … … … wn1 wn2 … wnd

CNN Encoder

Time dimension

Word dimension

w11 w12 … w1d

w21 w22 … w2d

… … … … wn1 wn2 … wnd

Time dimension 1 (coarse-grained): include all word embedding dimensions (Kim 2014) Torch: nn.TemporalConvolution #params = embeddingDim * numFilters * filterWidth

CNN Encoder

Time dimension

Word dimension

w11 w12 … w1d

w21 w22 … w2d

… … … … wn1 wn2 … wnd


Time dimension 1 (fine-grained): convolve each embedding dimension independently (Kalchbrenner 2014) Torch: nn.SpatialConvolution #params = 1 * numFilters * filterWidth

CNN Encoder

Time dimension

Word dimension

w11 w12 … w1d

w21 w22 … w2d

… … … … wn1 wn2 … wnd


Time dimension 1 (fine-grained): convolve each embedding dimension independently (Kalchbrenner 2014) Torch: nn.SpatialConvolution #params = 1 * numFilters * filterWidth

Word dimension (fine-grained): convolve each word independently (???) Torch: nn.SpatialConvolution #params = 1 * numFilters * filterWidth

Loss Functions •  Log-likelihood of predicted words from the decoder

§  Penalize for every wrong word §  Word order matters

•  Cosine distance between bag-of-words representations of gold and predicted sentences §  Representation the size of the vocabulary §  Word order doesn’t matter

Implementation Details •  Torch •  Minimal preprocessing of sentences •  Optimization with AdaGrad •  Dropout •  1000 dimensions for word and sentence vectors

Data •  Hotel reviews §  “we were the only people on our floor who spoke english” §  “first rate ! the rooms look like they have been recently renovated .” §  “recently stayed at the colonnade .”

•  Dataset sizes (# sentences)

Train set Validation

set Test set

10K-1M 100 100

Quantitative Evaluation Machine translation metrics: how well the decoded sentence

“translates” the original sentence

Encoder Train size

BLEU Meteor Val error Train error

LSTM 100K 39.3 32.9 27.3 9.3

LSTM 1M 55.2 42.6 12.5 10.4

LSTM (drop 0.3) 1M 63.9 45.1 14.7 7.0

CNN (word) 100K 0.8 5.3 62.9 55.2

CNN (time, fine-grained) 100K 0.6 6.2 53.8 49.1

CNN (time, coarse-grained) 100K 16.6 20.7 39.0 26.0



Encoder Train size


LSTM 100K 39.3 32.9 27.3 9.3

LSTM 1M 55.2 42.6 12.5 10.4

LSTM (drop 0.3) 1M 63.9 45.1 14.7 7.0

CNN (word) 100K 0.8 5.3 62.9 55.2





Encoder Train size


LSTM 100K 39.3 32.9 27.3 9.3

LSTM (drop 0.1) 1M 55.2 42.6 12.5 10.4

LSTM (drop 0.3) 1M 63.9 45.1 14.7 7.0

CNN (word) 100K 0.8 5.3 62.9 55.2





Encoder Train size


LSTM 100K 39.3 32.9 27.3 9.3

LSTM 1M 55.2 42.6 12.5 10.4

LSTM (drop 0.3) 1M 63.9 45.1 14.7 7.0

CNN (word) 100K 0.8 5.3 62.9 55.2



More observations •  Bag-of-words based loss did not help

More observations •  Bag-of-words based loss did not help

•  Preliminary results on Wikipedia are much lower

•  Possible explanations

§  Open domain, larger vocabulary, longer sentences

Model Train size BLEU Meteor

LSTM-LSTM 1M sentences 18.5 21.8

Qualitative Evaluation •  Run trained model on unseen sentences

•  Compare original and decoded sentences

Gold sentence Predicted sentence

1 we were the only people on our floor who spoke english

we were only the people who on our floor group seemed on top ,

2 which was nice . the place needs updated , which was nice . the place needs updating ,

3 but it's not horrible . but it's not horrible .

4 recently stayed at the colonnade . recently stayed at the conrad .

5 i must say i was extremely impressed with the staff and overall appearance of the hotel .

i must say i was extremely impressed with the cleanliness and helpfulness of the staff overall .

6 i would definitely stay here again and would recommend this hotel to family and friends .

i would definitely stay here again and would recommend this hotel to friends and family

Qualitative Evaluation

Qualitative Evaluation Gold sentence Predicted sentence






























Qualitative Evaluation •  Run trained model on train sentences

•  Create vector representations for train sentences

•  Cluster vectors with k-means

LSTM Encoder Clusters 'i would definitely stay here again . i love it ! 'but i would stay here again . ' 'i think i would stay here again ' 'i would ( and will ) stay here again . ' 'i would 100% stay here again . ' 'i would come here again . ' 'i would consider staying here again . ' 'would i stay here again ? '

'our staff was friendly and very fast to help us . ' 'but the staff was very friendly and accommodaCng . ' 'staff in the recepCon was very friendly . ' 'the check in staff was very friendly and helpful . ' 'the construcCon was complete . the staff is very friendly and helpful ' 'the hotel staff was very friendly and open to helping make dinner ' 'the internet was free and the staff was very friendly . '

'and the hotel is in an excellent locaCon . ' 'hotel is in a great locaCon -‐ nothing wrong with the neighbourhood . ' 'the back bay hotel is in a great locaCon ' 'the edison hotel is in a perfect locaCon ' 'the hotel circle is in a good locaCon i think ' 'the hotel is a fine hotel in a great area ' 'the hotel is huge and in a good downtown locaCon . '

CNN Encoder Clusters 'great locaCon !' 'locaCon !‘ 'locaCon locaCon locaCon !' 'cute and great locaCon !' 'great stay and fabulous locaCon !' 'staff and locaCon !‘ 'wonderful locaCon !'

'fresh fruit pastries etc .''coffee shops etc . ' 'outback steak house etc . ' 'dinner walk around etc . ' 'french toast pancakes fresh fruit etc .‘ 'dinner walk around etc . ‘ 'bread toast etc .‘ 'a whole foods grocery etc . '

'the room was very clean ' 'the room was very big ' 'the room was very spacious by new york hotel standards ' 'the room i recieved was very spacious ' 'the decor of the room was very nice and modern ' 'cons : room was very small ' 'the hotel room was very nice ' 'i liked the locaCon and the room was very nice ' 'the king room at the back of the hotel was very quiet ' 'the room as very modern '

Observations •  Clusters tend to differ by topic (hotel, location, staff) •  Certain bias towards the beginning of the sentence,

especially in pure LSTM model •  Sometimes failing to capture negation •  LSTM prefers full sentences, CNN also forms clusters

of words and sentences

Qualitative Evaluation Distances in 10 most dense clusters

Future Work •  General domain model from Wikipedia

Future Work •  General domain model from Wikipedia •  Improvements to LSTM implementations

§  Attention mechanism?


§  Attention mechanism? •  Supervised tasks (question similarity, answer selection)

§  Use autoencoder representation as fixed features §  Add a supervised classification layer




•  Better CNN models, also on fine-grained levels §  Deal with locality of convolution





•  Combine LSTM and CNN during encoding





•  Combine LSTM and CNN during encoding •  Decode with CNNs à variable length sentences?

Documents

Deep Learning for Sentence Representationpeople.csail.mit.edu/jrg/meetings/ibm-internship-summary...Deep Learning for Sentence Representation Internship Project Summary Yonatan Belinkov