Upload
gregory-skinner
View
212
Download
0
Embed Size (px)
Citation preview
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and
Data Integration Methods
William W. Cohen, Sunita Sarawagi
Presented by: Quoc Le
CS591CXZ – General Web Mining.
Motivation
• Information Extraction– Deriving structured data from unstructured
data.– Using structured data as guidance to improve
extracting unstructured sources.
• Name Entity Recognition– Extracting names, locations, times.– Improving NER systems with external
dictionaries.
Approaches
• Look-up entities in (large) dictionary.– Surface from is different, prone to noise and
errors.
• Take an existing NER system and link to an external dictionary.– High performance NER classify words to class
vs. similarity of entire entity to dictionary entry.
Problem Formulation
• Name finding as Word Tagging– E.g.: (Fred)Person (please stop by)Other (my
office)Loc (this afternoon)Time
– x: sequence of words map to y: sequence of labels. (x,y) pairs.
• Conditional distribution of y given x (HMM):
||
11),,|()|(
x
iii yxiyPxyP
Semi-Markovian NER
• Segmentation: S = <S1,…,SM>: start position tj, end position uj and a label lj.
– E.g.: S = {(1,1,Person), (2,4.Oth), (5,6,Loc), (7,8,Time)}
• Conditional semi-Markov Model (CSMM): Inference and Learning problems
j
jjj lxtSPxSP ),,|()|( 1
Compare to other approaches
• CMM: CSMM predicts tag + position.• Order-L CMM: CSMM uses corresponding
tokens (not just previous ones).• Treat external dictionary as training examples:
prone to misspelling, large dictionary, different dictionary. (Good when training data is limited).
• N-gram classification: entities may overlap• Use dictionary to bootstrap the search for
extraction patterns: rule-based vs. probabilistic.
Training SMM
• Modified version of Collins’ perceptron-based algorithm for training HMMs.
• Assume local feature function f which maps pair (x,S) and an index j to a vector of features f(j,x,S). Define:
• Let W be the weight vector over the components of F.– Inference: Compute V(W,x) – the Viterbi decoding of x with W.– Training: Learn W that lead to best performance.
• Viterbi search can be done with recurrence of Vx,W(i,y).
||
1),,(),(
S
jSxjfSxF
Perceptron-based SMM Learning
• Let SCORE(x,W,S) = W. F(x,S).
• For each example xt, St:
– Find K segmentations that gives highest score
– Let Wt+1 = Wt
– For each I such that SCORE(xt,Wt,Si) is greater than (1-β). SCORE(xt,Wt,St), update Wt+1: Wt+1 = Wt+1 + F(xt,St) – F(xt,Si)
• Return the average of all Wt.
Features
• Examples: value of segment, length of segment, left window, right window etc.
• Most can be applied to HMM NER system.
• More powerful and meaningful, e.g. “X+ X+” is more indicative than “X+” for name.
• Distance features: similarity to words in an external dictionary
Distance Features
• D: dictionary; d: distance metric, e: entity name in D. e’: segment. Define: – gD/d(e’) = min d(e,e’).
• Distance metric: Jaccard (word-level), Jaro-Winkler (character-level), TFIDF (word-level), SoftTFIDF (hybrid measure).
Experiments
HMM-VP(1): Predicts two labels y: one for tokens inside an entity & one outside.
• HMM-VP(4): Encoding scheme: Use labels with tags unique, begin, end, continue and other.– E.g.: (Fred)Personunique,please stop by the (fourth)Locbegin
(floor meeting)Loccontinue (room)Locend
• SMM (K = 2, E = 20, β = 0.05): any, first, last, exact.
• Data sets: address in India, student emails, jobs.
Considerations
• Evaluate exact matching against a dictionary: low recall, errors, but provide good indication of quality of dictionary.
• Normalizing dictionary entries: yes and no. E.g.: “Will” & “will”.
• For HMMs, we could use partial distance between tokens & dictionary entries.
• Segment size is bounded by some number
Evaluation
• Combination of NER methods; without external dictionary, with binary features, with distance features.
• Train with only 10% data, test with the rest. Do it 7 times and record average.
• Partial extraction gets no credit
• Use precision, recall and F1.
Results
• SMM-VP is best: outperforms HMM-VP(4) on 13 out of 15 cases.
• HMM-VP(1) is worst: HMM-VP(4) outperforms HMM-VP(1) on 13 out of 15 cases.
• Binary dict. features are helpful but distance features are more helpful.
• See Table 1 (details) and 4 (short).
Effects
• Improvements over Collins methods – T.5.
• The gap between SMM and HMM-VP(4) decreases when training size increases, but still at different convergent speed. T.2
• High order HMM doesn’t improve much performance. T.6
• Alternative (less-related) dictionaries: Both methods seems fairly robust.
Conclusion & Questions
• Conclusion– Incorporate knowledge nicely.– Applicable to sequential model.– Improvements is significant, but it uses more
resources and run 3-5 times slower.
• Questions:– What if dictionary is not super-set? Unrelated
dictionary.– Harder type of data which is not easy to get named.