Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi

Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and

Data Integration Methods

William W. Cohen, Sunita Sarawagi

Presented by: Quoc Le

CS591CXZ – General Web Mining.

Motivation

• Information Extraction– Deriving structured data from unstructured

data.– Using structured data as guidance to improve

extracting unstructured sources.

• Name Entity Recognition– Extracting names, locations, times.– Improving NER systems with external

dictionaries.

Approaches

• Look-up entities in (large) dictionary.– Surface from is different, prone to noise and

errors.

• Take an existing NER system and link to an external dictionary.– High performance NER classify words to class

vs. similarity of entire entity to dictionary entry.

Problem Formulation

• Name finding as Word Tagging– E.g.: (Fred)Person (please stop by)Other (my

office)Loc (this afternoon)Time

– x: sequence of words map to y: sequence of labels. (x,y) pairs.

• Conditional distribution of y given x (HMM):

||

11),,|()|(

x

iii yxiyPxyP

Semi-Markovian NER

• Segmentation: S = <S1,…,SM>: start position tj, end position uj and a label lj.

– E.g.: S = {(1,1,Person), (2,4.Oth), (5,6,Loc), (7,8,Time)}

• Conditional semi-Markov Model (CSMM): Inference and Learning problems

j

jjj lxtSPxSP ),,|()|( 1

Compare to other approaches

• CMM: CSMM predicts tag + position.• Order-L CMM: CSMM uses corresponding

tokens (not just previous ones).• Treat external dictionary as training examples:

prone to misspelling, large dictionary, different dictionary. (Good when training data is limited).

• N-gram classification: entities may overlap• Use dictionary to bootstrap the search for

extraction patterns: rule-based vs. probabilistic.

Training SMM

• Modified version of Collins’ perceptron-based algorithm for training HMMs.

• Assume local feature function f which maps pair (x,S) and an index j to a vector of features f(j,x,S). Define:

• Let W be the weight vector over the components of F.– Inference: Compute V(W,x) – the Viterbi decoding of x with W.– Training: Learn W that lead to best performance.

• Viterbi search can be done with recurrence of Vx,W(i,y).

||

1),,(),(

S

jSxjfSxF

Perceptron-based SMM Learning

• Let SCORE(x,W,S) = W. F(x,S).

• For each example xt, St:

– Find K segmentations that gives highest score

– Let Wt+1 = Wt

– For each I such that SCORE(xt,Wt,Si) is greater than (1-β). SCORE(xt,Wt,St), update Wt+1: Wt+1 = Wt+1 + F(xt,St) – F(xt,Si)

• Return the average of all Wt.

Features

• Examples: value of segment, length of segment, left window, right window etc.

• Most can be applied to HMM NER system.

• More powerful and meaningful, e.g. “X+ X+” is more indicative than “X+” for name.

• Distance features: similarity to words in an external dictionary

Distance Features

• D: dictionary; d: distance metric, e: entity name in D. e’: segment. Define: – gD/d(e’) = min d(e,e’).

• Distance metric: Jaccard (word-level), Jaro-Winkler (character-level), TFIDF (word-level), SoftTFIDF (hybrid measure).

Experiments

HMM-VP(1): Predicts two labels y: one for tokens inside an entity & one outside.

• HMM-VP(4): Encoding scheme: Use labels with tags unique, begin, end, continue and other.– E.g.: (Fred)Personunique,please stop by the (fourth)Locbegin

(floor meeting)Loccontinue (room)Locend

• SMM (K = 2, E = 20, β = 0.05): any, first, last, exact.

• Data sets: address in India, student emails, jobs.

Considerations

• Evaluate exact matching against a dictionary: low recall, errors, but provide good indication of quality of dictionary.

• Normalizing dictionary entries: yes and no. E.g.: “Will” & “will”.

• For HMMs, we could use partial distance between tokens & dictionary entries.

• Segment size is bounded by some number

Evaluation

• Combination of NER methods; without external dictionary, with binary features, with distance features.

• Train with only 10% data, test with the rest. Do it 7 times and record average.

• Partial extraction gets no credit

• Use precision, recall and F1.

Results

• SMM-VP is best: outperforms HMM-VP(4) on 13 out of 15 cases.

• HMM-VP(1) is worst: HMM-VP(4) outperforms HMM-VP(1) on 13 out of 15 cases.

• Binary dict. features are helpful but distance features are more helpful.

• See Table 1 (details) and 4 (short).

Effects

• Improvements over Collins methods – T.5.

• The gap between SMM and HMM-VP(4) decreases when training size increases, but still at different convergent speed. T.2

• High order HMM doesn’t improve much performance. T.6

• Alternative (less-related) dictionaries: Both methods seems fairly robust.

Conclusion & Questions

• Conclusion– Incorporate knowledge nicely.– Applicable to sequential model.– Improvements is significant, but it uses more

resources and run 3-5 times slower.

• Questions:– What if dictionary is not super-set? Unrelated

dictionary.– Harder type of data which is not easy to get named.

Documents

Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi