Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CSE 291G : Deep Learning for Sequences
Paper presentation
Topic : Named Entity Recognition
Rithesh
Outline
• Named Entity Recognition and its applications.
• Existing methods
• Character level feature extraction
• RNN : BLSTM- CNNs
Named Entity Recognition (NER)
Named Entity Recognition (NER)
WHAT ?
Named Entity Recognition (entity identification, entity chunking & entity extraction)
• Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc.
• Ex : Kim bought 500 shares of IBM in 2010.
Named Entity Recognition (entity identification, entity chunking & entity extraction)
• Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc.
• Ex : Kim bought 500 shares of IBM in 2010.
Named Entity Recognition (entity identification, entity chunking & entity extraction)
• Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc.
• Ex : Kim bought 500 shares of IBM in 2010.
Person name Organization Time
Named Entity Recognition (entity identification, entity chunking & entity extraction)
• Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc.
• Ex : Kim bought 500 shares of IBM in 2010.
• Importance of locating named entity in a sentence : Ex : Kim bought 500 shares of Bank of America in 2010.
Person name Organization Time
Named Entity Recognition (NER)
WHAT ?
WHY ?
Applications of NER
• Content Recommendations
• Customer support
• Classifying content for news providers
• Efficient Searching algorithms
• QA
• Machine Translation Systems
• Automatic Summarization system
Named Entity Recognition (NER)
WHAT ?
WHY ?
HOW ?
Approaches :
• ML Classification techniques (Ex : SVM, Perceptron model, CRF(Conditional Random Fields))
Drawback : Requires Hand-crafted features
• Neural Network Model (By Collobert – Natural Language Processing (almost) from scratch) Drawbacks : (i) Simple Feedforward NN with fixed window size (ii) Depends solely on word embeddings & fails to exploit character level features – prefix, suffix etc.
• RNN : LSTM – variable length input and long term memory
– First proposed by Hammerton in 2003
RNN : LSTM
• Overcome drawbacks of existing system
• Account for variable length input and long term memory
• Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S. Ex : Teddy bears are on sale. Teddy Roosevelt was a great president.
RNN : LSTM
• Overcome drawbacks of existing system
• Account for variable length input and long term memory
• Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S. Ex : Teddy bears are on sale. Teddy Roosevelt was a great president.
SOLUTION : Bi-directional LSTM (BLSTM) - Captures Information from the past and from the future.
RNN : LSTM
• Overcome drawbacks of existing system
• Account for variable length input and long term memory
• Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S. Ex : Teddy bears are on sale. Teddy Roosevelt was a great president.
SOLUTION : Bi-directional LSTM (BLSTM) - Captures Information from the past and from the future.
Fails to exploit character level features
Techniques to capture character level features
• Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS.
• Ling (2015) proposed a model for character level feature extraction using BLSTM for POS.
Techniques to capture character level features
• Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS.
• Ling (2015) proposed a model for character level feature extraction using BLSTM for POS.
• CNN or BLSTM?
Techniques to capture character level features
• Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS.
• Ling (2015) proposed a model for character level feature extraction using BLSTM for POS.
• CNN or BLSTM?
– BLSTM did not perform significantly better than CNN and also,
BLSTM is computationally more expensive to train.
Techniques to capture character level features
• Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS.
• Ling (2015) proposed a model for character level feature extraction using BLSTM for POS.
• CNN or BLSTM?
– BLSTM did not perform significantly better than CNN and also,
BLSTM is computationally more expensive to train.
BLSTM : Word level feature extraction CNN : Character level feature extraction
Named Entity Recognition with Bidirectional LSTM-CNNs
Jason P.C. Chiu, Eric Nichols (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the
Association for Computational Linguistics, 4, 357-370.
• Inspired by : – Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel Kuksa. 2011b. Natural language processing (almost) from scratch. The journal of Machine Learning Research, 12:2493-2537.pages 25-33.
– Cicero Santos, Victor Guimaraes. 2015. Boosting named entity recognition with neural character embeddings. Proceedings of the fifth Named Entities Workshop,
Reference paper : Boosting NER with Neural Character Embeddings
• CharWNN deep neural network – uses word and character level representations(embeddings) to perform sequential classification.
• HAREM I : Portuguese SPA CoNLL-2002 : Spanish
• CharWNN extends Collobert et al.’s (2011) neural network architecture for sequential classification by adding a convolutional layer to extract character-level representations.
CharWNN • Input : Sentence
• Output : For each word in the sentence a score for each class
CharWNN • Input : Sentence
• Output : For each word in the sentence a score for each class
S : <w1, w2, .. wN>
CharWNN • Input : Sentence
• Output : For each word in the sentence a score for each class
S : <w1, w2, .. wN>
wn
un
un =[rwrd; rwch]
CharWNN • Input : Sentence
• Output : For each word in the sentence a score for each class
S : <w1, w2, .. wN>
wn
un
un =[rwrd; rwch]
CNN for character embedding
CNN for character embedding
W : <c1, c2, ..cM>
CNN for character embedding
W : <c1, c2, ..cM>
CNN for character embedding
W : <c1, c2, ..cM>
Matrix vector operation with window size k
CNN for character embedding
W : <c1, c2, ..cM>
Matrix vector operation with window size k
CNN for character embedding
W : <c1, c2, ..cM>
Matrix vector operation with window size k
rwch
CharWNN • Input : Sentence
• Output : For each word in the sentence a score for each class
S : <w1, w2, .. wN>
wn
un
un =[rwrd; rwch]
rwch <u1, u2, .. uN>
CharWNN
• Input to convolution layer : <u1, u2, .. uN>
CharWNN
• Input to convolution layer : <u1, u2, .. uN>
Two Neural Network layers
CharWNN
• Input to convolution layer : <u1, u2, .. uN>
• For a Transition score matrix Atu
Two Neural Network layers
=
Network Training for CharWNN
• CharWNN is trained by minimizing the negative log-likelihood over the training set D.
• Interpret the sentence score as a conditional probability over a path (the score is exponentiated and normalized with respect to all possible paths)
• Stochastic gradient descent (SGD) to minimize the negative log-likelihood with respect to
Embeddings
• Word level Embedding : For Portuguese NER, the world level embeddings previously trained by Santos, 2004 was used. And for Spanish, Spanish wikipedia was used.
• Character level Embedding : Unsupervised learning of character level embeddings was NOT performed. The character level embeddings are initialized by randomly sampling each value from an uniform distribution.
Corpus : Portuguese & Spanish
Hyperparameters
Comparison of different NNs for the SPA CoNLL-2002 corpus
Comparison of different NNs for the SPA CoNLL-2002 corpus
Comparison with the state-of-the-art for the SPA CoNLL-2002 corpus
Comparison of different NNs for the HAREM I corpus
Comparison with the State-of-the-art for the HAREM I corpus
Chiu, J. P., & Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for
Computational Linguistics, 4, 357-370.
BLSTM : Word level feature extraction CNN : Character level feature extraction
Character Level feature extraction
Word level feature extraction
Word level feature extraction
Embeddings
• Word embeddings : 50 dimensional word embeddings released by Collobert (2011b) : Wikipedia & Reuters RCV-I corpus. Also, Stanford’s Glove and Google’s word2vec.
• Character embeddings : randomly initialized lookup table with values drawn from a uniform distribution with range [-0.5, 0.5] to output a character embedding of 25 dimensions.
Additional Features
• Additional word level features : – Capitalization feature : allCaps, upperInitial, lowercase, mixedCaps,
noinfo.
– Lexicons : SENNA and DBpedia
Training and Inference • Implementation :
– torch7 library
– Initial state of LSTM set to zero vectors.
• Objective : Maximize sentence level log-likelihood – The objective function and its gradient can be efficiently computed by
Dynamic programming.
– Viterbi algorithm is used to find the optimal tag sequence [ i ]T that maximizes :
• Learning : Training was done by mini-batch stochastic gradient descent (SGD) with a fixed learning rate, and each mini-batch consists of multiple sentences with same number of tokens.
Results
Results : F1 scores of BLSTM and BLSTM-CNN with various addition features
( emb : Collobert word embeddings, Char : character type feature, caps : capitalization feature, Lex : lexicon feature )
Results : Word embeddings
Results : Various dropout values
Questions to discuss
• Why BLSTM-CNNs is the best choice?
• Is the proposed model Language independent?
• Is it a good idea to use additional features( Capitalization, prefix, suffix etc.) ?
• Possible Future Works..
Thank you!!