55
CSE 291G : Deep Learning for Sequences Paper presentation Topic : Named Entity Recognition Rithesh

CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CSE 291G : Deep Learning for Sequences

Paper presentation

Topic : Named Entity Recognition

Rithesh

Page 2: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Outline

• Named Entity Recognition and its applications.

• Existing methods

• Character level feature extraction

• RNN : BLSTM- CNNs

Page 3: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Named Entity Recognition (NER)

Page 4: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Named Entity Recognition (NER)

WHAT ?

Page 5: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Named Entity Recognition (entity identification, entity chunking & entity extraction)

• Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc.

• Ex : Kim bought 500 shares of IBM in 2010.

Page 6: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Named Entity Recognition (entity identification, entity chunking & entity extraction)

• Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc.

• Ex : Kim bought 500 shares of IBM in 2010.

Page 7: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Named Entity Recognition (entity identification, entity chunking & entity extraction)

• Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc.

• Ex : Kim bought 500 shares of IBM in 2010.

Person name Organization Time

Page 8: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Named Entity Recognition (entity identification, entity chunking & entity extraction)

• Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc.

• Ex : Kim bought 500 shares of IBM in 2010.

• Importance of locating named entity in a sentence : Ex : Kim bought 500 shares of Bank of America in 2010.

Person name Organization Time

Page 9: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Named Entity Recognition (NER)

WHAT ?

WHY ?

Page 10: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Applications of NER

• Content Recommendations

• Customer support

• Classifying content for news providers

• Efficient Searching algorithms

• QA

• Machine Translation Systems

• Automatic Summarization system

Page 11: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Named Entity Recognition (NER)

WHAT ?

WHY ?

HOW ?

Page 12: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Approaches :

• ML Classification techniques (Ex : SVM, Perceptron model, CRF(Conditional Random Fields))

Drawback : Requires Hand-crafted features

• Neural Network Model (By Collobert – Natural Language Processing (almost) from scratch) Drawbacks : (i) Simple Feedforward NN with fixed window size (ii) Depends solely on word embeddings & fails to exploit character level features – prefix, suffix etc.

• RNN : LSTM – variable length input and long term memory

– First proposed by Hammerton in 2003

Page 13: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

RNN : LSTM

• Overcome drawbacks of existing system

• Account for variable length input and long term memory

• Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S. Ex : Teddy bears are on sale. Teddy Roosevelt was a great president.

Page 14: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

RNN : LSTM

• Overcome drawbacks of existing system

• Account for variable length input and long term memory

• Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S. Ex : Teddy bears are on sale. Teddy Roosevelt was a great president.

SOLUTION : Bi-directional LSTM (BLSTM) - Captures Information from the past and from the future.

Page 15: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

RNN : LSTM

• Overcome drawbacks of existing system

• Account for variable length input and long term memory

• Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S. Ex : Teddy bears are on sale. Teddy Roosevelt was a great president.

SOLUTION : Bi-directional LSTM (BLSTM) - Captures Information from the past and from the future.

Fails to exploit character level features

Page 16: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Techniques to capture character level features

• Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS.

• Ling (2015) proposed a model for character level feature extraction using BLSTM for POS.

Page 17: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Techniques to capture character level features

• Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS.

• Ling (2015) proposed a model for character level feature extraction using BLSTM for POS.

• CNN or BLSTM?

Page 18: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Techniques to capture character level features

• Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS.

• Ling (2015) proposed a model for character level feature extraction using BLSTM for POS.

• CNN or BLSTM?

– BLSTM did not perform significantly better than CNN and also,

BLSTM is computationally more expensive to train.

Page 19: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Techniques to capture character level features

• Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS.

• Ling (2015) proposed a model for character level feature extraction using BLSTM for POS.

• CNN or BLSTM?

– BLSTM did not perform significantly better than CNN and also,

BLSTM is computationally more expensive to train.

BLSTM : Word level feature extraction CNN : Character level feature extraction

Page 20: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Named Entity Recognition with Bidirectional LSTM-CNNs

Jason P.C. Chiu, Eric Nichols (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the

Association for Computational Linguistics, 4, 357-370.

• Inspired by : – Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray

Kavukcuoglu, and Pavel Kuksa. 2011b. Natural language processing (almost) from scratch. The journal of Machine Learning Research, 12:2493-2537.pages 25-33.

– Cicero Santos, Victor Guimaraes. 2015. Boosting named entity recognition with neural character embeddings. Proceedings of the fifth Named Entities Workshop,

Page 21: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Reference paper : Boosting NER with Neural Character Embeddings

• CharWNN deep neural network – uses word and character level representations(embeddings) to perform sequential classification.

• HAREM I : Portuguese SPA CoNLL-2002 : Spanish

• CharWNN extends Collobert et al.’s (2011) neural network architecture for sequential classification by adding a convolutional layer to extract character-level representations.

Page 22: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CharWNN • Input : Sentence

• Output : For each word in the sentence a score for each class

Page 23: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CharWNN • Input : Sentence

• Output : For each word in the sentence a score for each class

S : <w1, w2, .. wN>

Page 24: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CharWNN • Input : Sentence

• Output : For each word in the sentence a score for each class

S : <w1, w2, .. wN>

wn

un

un =[rwrd; rwch]

Page 25: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CharWNN • Input : Sentence

• Output : For each word in the sentence a score for each class

S : <w1, w2, .. wN>

wn

un

un =[rwrd; rwch]

Page 26: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CNN for character embedding

Page 27: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CNN for character embedding

W : <c1, c2, ..cM>

Page 28: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CNN for character embedding

W : <c1, c2, ..cM>

Page 29: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CNN for character embedding

W : <c1, c2, ..cM>

Matrix vector operation with window size k

Page 30: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CNN for character embedding

W : <c1, c2, ..cM>

Matrix vector operation with window size k

Page 31: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CNN for character embedding

W : <c1, c2, ..cM>

Matrix vector operation with window size k

rwch

Page 32: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CharWNN • Input : Sentence

• Output : For each word in the sentence a score for each class

S : <w1, w2, .. wN>

wn

un

un =[rwrd; rwch]

rwch <u1, u2, .. uN>

Page 33: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CharWNN

• Input to convolution layer : <u1, u2, .. uN>

Page 34: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CharWNN

• Input to convolution layer : <u1, u2, .. uN>

Two Neural Network layers

Page 35: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

CharWNN

• Input to convolution layer : <u1, u2, .. uN>

• For a Transition score matrix Atu

Two Neural Network layers

=

Page 36: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Network Training for CharWNN

• CharWNN is trained by minimizing the negative log-likelihood over the training set D.

• Interpret the sentence score as a conditional probability over a path (the score is exponentiated and normalized with respect to all possible paths)

• Stochastic gradient descent (SGD) to minimize the negative log-likelihood with respect to

Page 37: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Embeddings

• Word level Embedding : For Portuguese NER, the world level embeddings previously trained by Santos, 2004 was used. And for Spanish, Spanish wikipedia was used.

• Character level Embedding : Unsupervised learning of character level embeddings was NOT performed. The character level embeddings are initialized by randomly sampling each value from an uniform distribution.

Page 38: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Corpus : Portuguese & Spanish

Page 39: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Hyperparameters

Page 40: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Comparison of different NNs for the SPA CoNLL-2002 corpus

Page 41: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Comparison of different NNs for the SPA CoNLL-2002 corpus

Comparison with the state-of-the-art for the SPA CoNLL-2002 corpus

Page 42: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Comparison of different NNs for the HAREM I corpus

Comparison with the State-of-the-art for the HAREM I corpus

Page 43: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Chiu, J. P., & Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for

Computational Linguistics, 4, 357-370.

BLSTM : Word level feature extraction CNN : Character level feature extraction

Page 44: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Character Level feature extraction

Page 45: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Word level feature extraction

Page 46: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Word level feature extraction

Page 47: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Embeddings

• Word embeddings : 50 dimensional word embeddings released by Collobert (2011b) : Wikipedia & Reuters RCV-I corpus. Also, Stanford’s Glove and Google’s word2vec.

• Character embeddings : randomly initialized lookup table with values drawn from a uniform distribution with range [-0.5, 0.5] to output a character embedding of 25 dimensions.

Page 48: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Additional Features

• Additional word level features : – Capitalization feature : allCaps, upperInitial, lowercase, mixedCaps,

noinfo.

– Lexicons : SENNA and DBpedia

Page 49: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Training and Inference • Implementation :

– torch7 library

– Initial state of LSTM set to zero vectors.

• Objective : Maximize sentence level log-likelihood – The objective function and its gradient can be efficiently computed by

Dynamic programming.

– Viterbi algorithm is used to find the optimal tag sequence [ i ]T that maximizes :

• Learning : Training was done by mini-batch stochastic gradient descent (SGD) with a fixed learning rate, and each mini-batch consists of multiple sentences with same number of tokens.

Page 50: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Results

Page 51: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Results : F1 scores of BLSTM and BLSTM-CNN with various addition features

( emb : Collobert word embeddings, Char : character type feature, caps : capitalization feature, Lex : lexicon feature )

Page 52: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Results : Word embeddings

Page 53: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Results : Various dropout values

Page 54: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Questions to discuss

• Why BLSTM-CNNs is the best choice?

• Is the proposed model Language independent?

• Is it a good idea to use additional features( Capitalization, prefix, suffix etc.) ?

• Possible Future Works..

Page 55: CSE 291G : Deep Learning for Sequences · RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases

Thank you!!