Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Neural Models for Sequence Chunking
Publish by Feifei et al
(IBM Watson)
Presented by Sagar Dahiwala
CIS 601
Agenda
1. Natural language understanding
2. Problem in current system
3. Basic neural networks• RNN – LSTM
4. Implemented Model 1, 2, 3
5. Experiments
6. Conclusion
1. Natural language understanding (NLU)
• NLU task such as,
1. Shallow parsing • analysis of a sentence - first identifies constituent parts of sentences (nouns,
verbs, adjectives, etc.)
• links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.)
2. semantic slot filling
• Require the assignment of representative labels to the meaningful chunks in a sentence.
2. Problem in current system
• Most of the current Deep neural network (DNN) – based method considers this task as sequence labeling problem.
• Sequence labeling problem • words is treated as the basic unit of the labeling, rather than chunks.
IOB-based (Inside-Outside-Beginning)Sequence labeling
• B – Stands for Beginning of chunk
• O – Artificial class
• VP – Verb phrase
• I – Inside of chunk, other words within the same semantic chunk
• NP – Noun phrase
Sentence : “But it could be much worse”
3. Basic neural networks
1. RNN – Recurrent Neural Network
2. LSTM – Long Short-term memory
3.1 RNN – Recurrent Neural Network
3.1 RNN – Recurrent Neural Network
3.1 RNN – Recurrent Neural Network
3.1 RNN – Recurrent Neural Network
3.1 RNN – Recurrent Neural Network
3.2 LSTM – Long Short-term memory
• Element wise addition (+)
• Element wise multiplication (X)
IOB schema for labeling problem has two drawbacks• No Explicit model
• to learn and identify the scope of chunks in sentence, instead we infer them implicitly.
• Some Neural Network (NN) like RNN and LSTM have the ability to encode context information • but don’t treat each chunk as a complete unit
Natural solution to overcome above two drawbacks is Sequence Chunking• Two sub task
• Segmentation – identify the scope of the chunks explicitly
• Labeling – to label each chunk as single unit based on segmentation results
• How humans remember things• Phone numbers are not typically seen or remembered as a long string of
numbers like 8605554589, but rather 860-555-4589.
• Birthdates are typically not recalled by 11261995, but rather 11/26/1995.
4. Model 1
• Average(.) computes the average of input vectors.
• Uses softmax layer for labeling. • In Above Figure 2, “much
worse” is identified as a chunk with length 2;
• apply hidden states in formula, to finally get the “ADJP” label.
Bi-LSTM
• Given an input sentence 𝑥 = (𝑥1, 𝑥2, … , 𝑥𝑇)
• Forward LSTM reads Input sentence from 𝑥1 𝑡𝑜 𝑥𝑇• Generate, Forward hidden states (ℎ1 , ℎ2 , … , ℎ𝑇 )
• Backward LSTM reads Input sentence from 𝑥𝑇 𝑡𝑜 𝑥1• Generate, Backward hidden states (ℎ1, ℎ2, … , ℎ𝑇 )
• Then for each timestep t, the hidden states of Bi-LSTM is generated by concatenating
ℎ𝑡 𝑎𝑛𝑑 ℎ𝑡 , ℎ𝑡 = [ℎ𝑡 ; ℎ𝑡]
Drawbacks of model 1
•May not perform well •on both segmentation and labeling subtasks
4. Model 2
• Follows encoder-decoder framework.
• Similar to Model 1, we Employ a Bi-LSTM for segmentation with IOB labels.
• This Bi-LSTM serve as encoder and create a sentence representation [ℎ𝑡 ; ℎ1]. Which is used to initialize the decoder LSTM.
• use chunks as the inputs instead of words
• for example: “much worse” is a chunk in Figure 3, and we take it as a single input to the decoder.
4. Model 2
Where g(.) is CNNMax layer
Cwj is the concatenation of context word embeddingsDifference is {Cxj, Chj, Cwj}.
The generated hidden states are finally used for labeling by a softmax layer
Drawbacks of using IOB labels for segmentation
• Hard to use chunk level features for segmentation, like length of chunks
• IOB labels can not compare different chunks directly
4. Model 3
• Model III is similar to Model II, • the only difference being the method of identifying chunks.
• Model III is a greedy process of segmentation and labeling, where we first identify one chunk, and then label it.
• Repeat the process till all word are processed. As all chunks are adjacent to each other 2, after one chunk is identified, the beginning point of the next one is also known, and only its ending point is to be determined.
4. Model 3
4. Model 3
Here, they implement pointer network to do so. Where j is decoder timestemp (chunk index)
The probability of choosing ending point candidate i is:
5. Experiments
• Text Chunking Results
• Compare with publish report
5. Experiments
• Slot Filling Results• Segmentation Results
• Labeling Results
• Compare with publish report
Segmentation Results
Labeling Results
Compare with publish report
ReferenceAuthor / title Context Link
Lampel et al. (2016) – Neural Architectures for Named Entity
Recognition
Stack-LSTM and transition based algorithm https://arxiv.org/pdf/1603.01360.pdf
Dyer et al. Stack-LSTM http://www.cs.cmu.edu/~lingwang/papers/acl2015.pdf
Wiki Softmax layer https://en.wikipedia.org/wiki/Softmax_function
Bahdanau, Cho, and Bengio 2014.
On the Properties of Neural Machine Translation: Encoder–Decoder
Approaches
encoder-decoder framework https://arxiv.org/pdf/1409.1259.pdf
Convolutional Neural Networks (CNNs / ConvNets) CNN http://cs231n.github.io/convolutional-networks/
Nallapati et al. 2016 - Abstractive Text Summarization using Sequence-
to-sequence RNNs and Beyond
encoder-decoder-pointer framework https://arxiv.org/pdf/1602.06023.pdf
Pointer Networks -
Vinyals, Fortunato, and Jaitly 2015
Pointer Network https://arxiv.org/pdf/1506.03134.pdf
Brandon Rohrer - Recurrent Neural Networks (RNN) and Long Short-
Term Memory (LSTM)
RNN-LSTM https://www.youtube.com/watch?v=WCUNPb-5EYI
Spoken Language Understanding(SLU)/Slot Filling in Keras ATIS – Airline Travel Information System https://github.com/chsasank/ATIS.keras
Thank You