India Research Lab
© Copyright IBM Corporation 2006
Entity Annotation using operations on the Inverted Index
Ganesh Ramakrishnan, with Sreeram Balakrishnan and Sachindra Joshi
IBM India Research Lab
| 2
India Research Lab
© Copyright IBM Corporation 2006
Problem: Entity Annotation
Extract all instances of entities of type E from an unstructured source S.- Company names, Designation, Person names, Date, Time
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME
TITLE ORGANIZATION
Bill Gates
CEO
Microsoft
Bill Veghte
VP
Microsoft
Richard Stallman
founder
Free Soft..
| 3
India Research Lab
© Copyright IBM Corporation 2006
Document-at-a-time Based Approach
ML / Hand-built rules
TokenizerPOS Lookup
Gazetteer Lookupetc…
Fea
ture
Co
llect
ion
Inst
ance
Ext
ract
or
AnnotatedDocument
………………………………………………………………………………………………………………………...……….
…<>………………………</>……<>………………….………</
>…………………<>……</
>…………………………………………<>……</>…<>…….</>…
A SingleNon-annotated
Document
Documentcollection
Annotateddocumentcollection
A few rule-based annotators exist: E.g. GATE. We have built a rule-based annotator at IRL
| 4
India Research Lab
© Copyright IBM Corporation 2006
Example: Rules for identifying ORGANIZATIONs
How to identify?
B.P. Marsh PlcThe U.S.B. Holding Co.U.S.B. Holding Group
| 5
India Research Lab
© Copyright IBM Corporation 2006
Example rule for identifying ORGANIZATION instances
Regular expression macros
Dictionary attribute
ORPart of speech
tag
U.S.B.
The
Holding
Co.
| 6
India Research Lab
© Copyright IBM Corporation 2006
Problems with Document-at-a-time Based Approach on large corpora
Repeated computations for multiple occurrences of same token:- Dictionary-lookups- Regular expression matches
Large over-heads while- Re-annotating a corpus after changing dictionary entries
The user realizes that “Group” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry
| 7
India Research Lab
© Copyright IBM Corporation 2006
Problems with Document-at-a-time Based Approach on large corpora
Repeated computations for multiple occurrences of same token:- Dictionary-lookups- Regular expression matches
Large over-heads while- Re-annotating a corpus after changing dictionary entries
The user realizes that “Group” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry
- Re-annotating a corpus with slight modification in rules The user realizes that the optional “The” at the beginning introduces too many wrong
annotations and modifies the rule
| 8
India Research Lab
© Copyright IBM Corporation 2006
The rule with the optional “The” at the beginning removed
| 9
India Research Lab
© Copyright IBM Corporation 2006
Problems with Document-at-a-time Based Approach on large corpora
Repeated computations for multiple occurrences of same token:- Dictionary-lookups- Regular expression matches
Large over-heads while- Re-annotating a corpus after changing dictionary entries
The user realizes that “Group” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry
- Re-annotating a corpus with slight modification in rules The user realizes that the optional “The” at the beginning introduces too many wrong
annotations and modifies the rule- Making incremental annotation updates by adding new rules
The user wants a new rule that identifies “C.B. Fairlie Holding & Finance Limited”
| 10
India Research Lab
© Copyright IBM Corporation 2006
A new rule to capture an interspersed conjunction
| 11
India Research Lab
© Copyright IBM Corporation 2006
Problems with Document-at-a-time Based Approach on large corpora
Repeated computations for multiple occurrences of same token:- Dictionary-lookups- Regular expression matches
Large over-heads while- Changing dictionary entries
The user realizes that “Group” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry
- Re-annotating a corpus with slight modification in rules The user realizes that the optional “The” at the beginning introduces too many wrong
annotations and modifies the rule
- Making incremental annotation updates by adding new rules The user wants a new rule that identifies “C.B. Fairlie Holding & Finance Limited” The user wants a new rule that identifies acquiring organizations:
“AT&T Wireless, Inc. ” (that purchased Alaska Communications System in 1995)
| 12
India Research Lab
© Copyright IBM Corporation 2006
A new rule to identify acquiring organizations
Post-context specifier
| 13
India Research Lab
© Copyright IBM Corporation 2006
An alternative approach: Operating on the Inverted Index
Inverted Index- A compact representation of the collection- Captures redundancies/repetition information
| 14
India Research Lab
© Copyright IBM Corporation 2006
Structure of Index
Example:The company said that it will acquire the other company
the
company
said
that
it
will
acquire
other
sid first last
Posting List
sid: a sentence identifierfirst: beginning position of an occurrencelast: end position of the same occurrence
Basic Entities Orthographic properties Dictionary Features
| 15
India Research Lab
© Copyright IBM Corporation 2006
An alternative approach: Operating on the Inverted Index
Inverted Index- A compact representation of the collection- Captures redundancies/repetition information
Many applications build an inverted index on the annotated corpus anyways- We directly update the inverted index with annotation entries
| 16
India Research Lab
© Copyright IBM Corporation 2006
Our approach: Index Based Entity Annotation
| 17
India Research Lab
© Copyright IBM Corporation 2006
Complexity Analysis for Document based Approach
Problem: Find all annotations of length at most Solution: Given a regular expression R, convert it into a DFA DR
Complexity
N
i
N
iiiD ScntWScntC
1 2
)()(
visitedisStimesofnumberScnt
DinstateaS
tokensofnumbertotalW
DinstatesofnumberN
ii
Ri
R
)(
| 18
India Research Lab
© Copyright IBM Corporation 2006
Operations on Index
merge(L,L’) : returns a posting list where each entry in the returned posting list occurs either in posting list L or L’ or in both
consint(L, L’) : returns a posting list where each entry in the posting list points to a token sequence which consists of two consecutive subsequences @sa and @sb such that L has a pointer to @sa and L’ has a pointer to @sb
1
|'|
||log|'|2,1
||
|'|log||2|),'||(|min 22 L
LL
L
LLLL
|'||| LL
| 19
India Research Lab
© Copyright IBM Corporation 2006
Implementing a DFA using Index
With each pair of state s and list k associate is a posting list of token sequences of length k
which end in state s Iteratively compute from its predecessor states
)))(,consint()),(,(consintmerge(1,21,22,3
cLlistbLlistlist sss
1;, Sslist ks
kslist ,
kslist ,
S1a
b
ccS2 S3 S4
))(,(consint2,33,4
cLlistlist ss
| 20
India Research Lab
© Copyright IBM Corporation 2006
)(||1
, ik
kS Scntlisti
Complexity Analysis for Index based Approach
)()log(2|))(log(|2 )(
i
N
i SdestsisiI ScntSprevC
i
)()log(2|))(log(|2 )(
i
N
i SdestsisiI ScntSprevC
i
sizeslistpostingtheofratio
SSprev
is
ii
into arcs incoming ofnumber |)(|
Observation:
])([
)()log(2|))(log(|
2
2 )(
N
ii
i
N
i Sdestsisi
D
I
ScntW
ScntSprev
C
C i
N
ii
i
N
i Sdestsisi
D
I
Sfcnt
SfcntSprev
C
C i
2
2 )(
)(1
)()log(2|))(log(|
| 21
India Research Lab
© Copyright IBM Corporation 2006
Example: Simple Dictionary MatchLet tokens in T be drawn from {a,b…z}Let D be a dictionary {a,e,i,o,u}A simple 2 state DFA that matches D is:
S1 S2
ae
i
o
u
Ratio of document based match to index based match
)()5log(
)(1
2
2
Sfcnt
Sfcnt
C
C
I
D
27.0)()(
:Desirable
22 ScntoffractionSfcnt
| 22
India Research Lab
© Copyright IBM Corporation 2006
Index based Annotation using Regular Expressions
NFA to DFA conversion may cause explosion of states
Scan regular expression from left to right and build AND/OR graph recursively
Compute posting list using AND/OR graph by propagating lists from leaves to root node AND
| 23
India Research Lab
© Copyright IBM Corporation 2006
Handling ? And Kleen Operators
Each node contains two binary properties- isOpt: 1 if the regular
expression of the form R? (selfRecursion=? or *)
- selfLoop: 1 if the regular expression matched is of the form R+ (one or more times)
(selfRecursion=* or +)- For R* both the properties
are set
| 24
India Research Lab
© Copyright IBM Corporation 2006
New Operations
consint(L,L’): Generated list has isOpt set iff if both the arguments have isOpt set
merge(L,L’): Generated list has isOpt set if any of the arguments have isOpt set.
consint(L,+): Returns posting list such that each entry points to at most subsequences in L
| 25
India Research Lab
© Copyright IBM Corporation 2006
Computing Regular Expression using AND/OR Graph
Compute posting lists with each node from bottom up.
For each AND node use consint operation with the posting list of children nodes.
For each OR node use merge operation
| 26
India Research Lab
© Copyright IBM Corporation 2006
Experimental Results
Data sets- Enron email: 2.3 GB- Reuters+20NG: 93 MB
8 rules for 4 annotations- Person name, company
name, location and date
Data set GATE Index based Speedup
Factor
Enron 4974343 374926 13.26
Reuter+ 752287 92238 8.15
A greater speedup is achieved on larger corpus Incremental annotations achieve even larger performance gains
Data set GATE Index based Speedup Factor
Enron 1479954 62227 23.78
Reuter+ 661157 17929 36.87
India Research Lab
© Copyright IBM Corporation 2006
THANK YOU