32
Provenance-based Dictionary Refinement in Information Extraction Sudeepa Roy University of Washington with Laura Chiticariu Vitaly Feldman Frederick R. Reiss Huaiyu Zhu IBM Research-Almaden

Sudeepa Roy University of Washington with Laura Chiticariu Vitaly Feldman Frederick R. Reiss Huaiyu Zhu IBM Research-Almaden

Embed Size (px)

Citation preview

  • Slide 1

Sudeepa Roy University of Washington with Laura Chiticariu Vitaly Feldman Frederick R. Reiss Huaiyu Zhu IBM Research-Almaden Slide 2 Information Extraction (IE) 6/24/2013SIGMOD 20132 Person Extractor Extracting structured information (entity, relation) from unstructured text Michael Kelly Chelsea Cynthia Rowley September : true positive : false positive Mention (span) (doc3, offset = [10, 21]) Slide 3 We want to improve the quality of the extractors i.e. output as many true positives as possible and, as few false positives as possible 6/24/2013SIGMOD 20133 this work... by Dictionary Refinement (can be used by itself or in addition to other approaches) Slide 4 What are Dictionaries? 6/24/2013SIGMOD 20134 Person Extractor Michael Kelly Chelsea Cynthia Rowley September : true positive : false positive Set of dictionaries Set of extraction rules Tiziana Daniela Guido Luciano Cesare Damp Glandorf Cobbenrode Bad Laer Aach Time Warner Comcast Corp. Walt Disney Co. News Corp. DirecTV Group Locations (German cities) Organizations (media) Names (Italian first names) Dictionary 1Dictionary 2Dictionary 3 Dictionaries in IE List of entries - person names, locations, organizations, etc. - also called Gazetteers/Lists Examples: Rule-Based IE systems like SystemT from IBM Almaden Our approach can be useful also for machine-learning-based IE systems Slide 5 A Toy Example 6/24/2013SIGMOD 2013 R1: create view FirstName as Dictionary(first.dict, Document, text); R2: create view LastName as Dictionary(last.dict, Document, text); R3: create view FullName as Dictionary(full.dict, Document, text); R4: create view FirstLast as select Merge(F.match, L.match) as match from FirstName F, LastName L where FollowsTok(F.match, L.match, 0, 0); R5: create table Person(match span); insert into Person ( select * from FullName A ) union ( select * from FirstLast A ) ) union ( select * from FirstName A where Not(MatchesRegex([ ]*[A-Z].*, RightContextTok(A.match, 1 full.dict first.dict chelsea john april david last.dict smith lee april smith john brown Set of extraction rules Set of dictionaries Output union of 1. FirstName, 2. FullName and 3. FirstName followed by LastName (FirstLast) as Person Declarative Rule language (AQL/Annotation Query Language in SystemT) Details not needed in the talk Set of dictionaries Set of extraction rules Slide 6 A Toy Example - Execution 6/24/2013SIGMOD 2013 full.dict first.dict chelsea john april david last.dict This April, mark your calendars for the first derby of the season: Arsenal at Chelsea. .,..April Smith and John Lee reporting live from .. David said that smith lee april smith john brown Set of dictionaries Input Document April Chelsea April Smith John Lee David Smith Lee 6 April Smith April Chelsea April John David FirstName LastName April Chelsea April Smith John Lee David Person April Smith John Lee FirstLast Views/Results FullName Extract first names (from first.dict) Extract last names (from last.dict) Extract full names (from full.dict) Find FirstLast = firstnames followed by lastnames Output union of FirstName, FullName and FirstLast as Person Two different mentions (spans) of April in Document Slide 7 1. Better Dictionaries Better Result Quality Entries are collected from diverse sources, or generated automatically by a previous IE step therefore noisy 6/24/2013SIGMOD 20137 So, refined dictionaries are desirable Even more reasons 2. To create specialized dictionaries based on Domains (sports articles) or Languages (German newspaper) e.g. from All First or Last Names to German Last Names 3. To add sophisticated extraction rules First find ambiguous entries, then treat them differently e.g. Chelsea, April Slide 8 What do we mean by dictionary refinement? Output the entries that are responsible for many false positives and not for too many true positives 6/24/2013SIGMOD 20138 In practice, a human supervisor decides whether to actually delete these entries/ add a new rule to handle them if deleted get deleted do not get deleted Slide 9 Next, Our Problem Definition Why is Provenance useful in IE? Measuring Output Quality 6/24/2013SIGMOD 20139 Slide 10 Provenance in IE 6/24/2013SIGMOD 2013 full.dict first.dict chelsea john april david last.dict smith lee april smith john brown Set of dictionaries April Chelsea April Smith John Lee David Smith Lee 10 April Smith April Chelsea April John David FirstName LastName April Chelsea April Smith John Lee David Person April Smith John Lee FirstLast Views/Results FullName Connection with Relational DB and Queries Assign unique id-s to entries Interpret extraction rules as an RA query Propagate provenance annotation to the output (ref. Provenance Semiring by Green et. al. 2007) Relations w5w5 w7w7 w4w4 w6w6 w1w1 w2w2 w3w3 w 5 + w 3 w 4 w3w3 w3w3 w4w4 w4w4 w5w5 w5w5 w 3 w 4 w8w8 Many-to-many dependency Which entries should be deleted to delete April and retain April Smith? Slide 11 Measuring Output Quality Standard Measures in IE Precision (P) = /( + ) in output = 3/5 Recall (R) = in output / actual = 3/3 F-score (F) = 2/(1/P + 1/R) = 3/4 6/24/2013SIGMOD 201311 April Chelsea April Smith John Lee David w 1 : chelsea w 2 : john w 3 : april w 4 : smith w 6 : lee w 5 : april smith w 7 : john brown 3/4 6/7 3/3 w 5 + w 3 w 4 w3w3 w3w3 (to balance deletion of false and true positives) 2/3 false w5w5 When entries are deleted, results whose provenance becomes False get deleted We use F-score: Higher F-score = higher extractor quality Max value 1 Delete w 3 Delete {w 3, w 5 } Provenance and F-score Slide 12 6/24/2013SIGMOD 201312 Dictionary Refinement Problem Objective: Maximize F-Score Select a set S of entries to remove that maximizes the new F-score subject to |S| k or new recall r Select a set S of entries to remove that maximizes the new F-score subject to |S| k or new recall r Size Constraint (limit #deleted entries) Size Constraint (limit #deleted entries) Recall Constraint (limit #true positives deleted) Recall Constraint (limit #true positives deleted) April Chelsea April Smith John Lee David w 5 + w 3 w 4 w3w3 w3w3 Input: Provenance of each result / Label of each result Input: Provenance of each result / Label of each result w7w7 w7w7 w1w1 w1w1 w 2 w 6 We also studied the incomplete labeling case S = Possible output New F-score = 1 w1: chelsea w3: april Slide 13 We have an optimization problem: Maximize F-score What are the main challenges? 6/24/2013SIGMOD 201313 Slide 14 6/24/2013SIGMOD 2013 full.dict first.dict chelsea john april david last.dict smith lee april smith john brown Set of dictionaries April Chelsea April Smith John Lee David Smith Lee 14 April Smith April Chelsea April John David FirstName LastName April Chelsea April Smith John Lee David Person April Smith John Lee FirstLast Outputs/Results FullName Complex Input-Output Dependency w5w5 w4w4 w3w3 w 5 + w 3 w 4 w3w3 w3w3 w4w4 w4w4 w5w5 w5w5 w 3 w 4 Multiple Inputs One output One Input Multiple outputs (true/false positives) Many-to-many Which entries should we delete to retain April Smith and delete April? Slide 15400 extraction rules > 200 dictionaries (in addition to features like regular expressions, parts of speech, capitalization, punctuation) Extracts Person, Organization, Locations, > 400 extraction rules > 200 dictionaries (in addition to features like regular expressions, parts of speech, capitalization, punctuation) Actual Extractors in IBM-Almaden SystemT"> 15 --------------------------------------- create view ValidLastNameAll as select N.lastname as lastname from LastNameAll N -- do not allow partially all capitalized words where Not(MatchesRegex(/(\p{Lu}\p{M}*)+-.*([\p{Ll}\p{Lo}]\p{M}*).*/, N.lastname)) and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*- (\p{Lu}\p{M}*)+/, N.lastname)); create view LastName as select C.lastname as lastname --from Consolidate(ValidLastNameAll.lastname) C; from ValidLastNameAll C consolidate on C.lastname; -- Find dictionary matches for all first names -- Mostly US first names create view StrictFirstName1 as select D.match as firstname from Dictionary('strictFirst.dict', Doc.text) D -- where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); -- German first names create view StrictFirstName2 as select D.match as firstname from Dictionary('strictFirst_german.dict', Doc.text) D -- where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); -- nick names for US first names create view StrictFirstName3 as select D.match as firstname from Dictionary('strictNickName.dict', Doc.text) D -- where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); -- german first name from blue page create view StrictFirstName4 as select D.match as firstname from Dictionary('strictFirst_german_bluePages.dict', Doc.text) D -- where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); -- Italy first name from blue pages create view StrictFirstName5 as select D.match as firstname from Dictionary('names/strictFirst_italy.dict', Doc.text) D where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); -- France first name from blue pages create view StrictFirstName6 as select D.match as firstname from Dictionary('names/strictFirst_france.dict', Doc.text) D where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); -- Spain first name from blue pages create view StrictFirstName7 as select D.match as firstname from Dictionary('names/strictFirst_spain.dict', Doc.text) D where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); -- Indian first name from blue pages -- TODO: still need to clean up the remaining entries create view StrictFirstName8 as select D.match as firstname from Dictionary('names/strictFirst_india.partial.dict', Doc.text) D where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); -- Israel first name from blue pages create view StrictFirstName9 as select D.match as firstname from Dictionary('names/strictFirst_israel.dict', Doc.text) D where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); -- union all the dictionary matches for first names create view StrictFirstName as (select S.firstname as firstname from StrictFirstName1 S) union all (select S.firstname as firstname from StrictFirstName2 S) union all (select S.firstname as firstname from StrictFirstName3 S) union all (select S.firstname as firstname from StrictFirstName4 S) union all (select S.firstname as firstname from StrictFirstName5 S) union all (select S.firstname as firstname from StrictFirstName6 S) union all (select S.firstname as firstname from StrictFirstName7 S) union all (select S.firstname as firstname from StrictFirstName8 S) union all (select S.firstname as firstname from StrictFirstName9 S); -- Relaxed versions of first name create view RelaxedFirstName1 as select CombineSpans(S.firstname, CP.name) as firstname from StrictFirstName S, StrictCapsPerson CP where FollowsTok(S.firstname, CP.name, 1, 1) and MatchesRegex(/\-/, SpanBetween(S.firstname, CP.name)); create view RelaxedFirstName2 as select CombineSpans(CP.name, S.firstname) as firstname from StrictFirstName S, StrictCapsPerson CP where FollowsTok(CP.name, S.firstname, 1, 1) and MatchesRegex(/\-/, SpanBetween(CP.name, S.firstname)); -- all the first names create view FirstNameAll as (select N.firstname as firstname from StrictFirstName N) union all (select N.firstname as firstname from RelaxedFirstName1 N) union all (select N.firstname as firstname from RelaxedFirstName2 N); create view ValidFirstNameAll as select N.firstname as firstname from FirstNameAll N where Not(MatchesRegex(/(\p{Lu}\p{M}*)+-.*([\p{Ll}\p{Lo}]\p{M}*).*/, N.firstname)) and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*- (\p{Lu}\p{M}*)+/, N.firstname)); create view FirstName as select C.firstname as firstname --from Consolidate(ValidFirstNameAll.firstname) C; from ValidFirstNameAll C consolidate on C.firstname; -- Combine all dictionary matches for both last names and first names create view NameDict as select D.match as name from Dictionary('name.dict', Doc.text) D -- where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); create view NameDict1 as select D.match as name from Dictionary('names/name_italy.dict', Doc.text) D where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); create view NameDict2 as select D.match as name from Dictionary('names/name_france.dict', Doc.text) D where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); create view NameDict3 as select D.match as name from Dictionary('names/name_spain.dict', Doc.text) D where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); create view NameDict4 as select D.match as name from FirstName FN, InitialWord IW, CapsPerson CP where FollowsTok(FN.firstname, IW.word, 0, 0) and FollowsTok(IW.word, CP.name, 0, 0); /** * Translation for Rule 3r2 * * This relaxed version of rule '3' will find person names like Thomas B.M. David * But it only insists that the second word is in the person dictionary */ /* CAPSPERSON INITIALWORD CAPSPERSON */ create view Person3r2 as select CombineSpans(CP.name, LN.lastname) as person from LastName LN, InitialWord IW, CapsPerson CP where FollowsTok(CP.name, IW.word, 0, 0) and FollowsTok(IW.word, LN.lastname, 0, 0); /** * Translation for Rule 4 * * This rule will find person names like David Thomas */ /* CAPSPERSON CAPSPERSON */ create view Person4WithNewLine as select CombineSpans(FN.firstname, LN.lastname) as person from FirstName FN, LastName LN where FollowsTok(FN.firstname, LN.lastname, 0, 0); -- Yunyao: 05/20/2008 revised to Person4WrongCandidates due to performance reason -- NOTE: current optimizer execute Equals first thus make Person4Wrong very expensive --create view Person4Wrong as --select CombineSpans(FN.firstname, LN.lastname) as person --from FirstName FN, -- LastName LN --where FollowsTok(FN.firstname, LN.lastname, 0, 0) -- and ContainsRegex(/[\n\r]/, SpanBetween(FN.firstname, LN.lastname)) -- and Equals(GetText(FN.firstname), GetText(LN.lastname)); create view Person4WrongCandidates as select FN.firstname as firstname, LN.lastname as lastname from FirstName FN, LastName LN where FollowsTok(FN.firstname, LN.lastname, 0, 0) and ContainsRegex(/[\n\r]/, SpanBetween(FN.firstname, LN.lastname)); create view Person4 as (select P.person as person from Person4WithNewLine P) minus (select CombineSpans(P.firstname, P.lastname) as person from Person4WrongCandidates P where Equals(GetText(P.firstname), GetText(P.lastname))); /** * Translation for Rule4a * This rule will find person names like Thomas, David */ /* CAPSPERSON \, CAPSPERSON */ create view Person4a as select CombineSpans(LN.lastname, FN.firstname) as person from FirstName FN, LastName LN where FollowsTok(LN.lastname, FN.firstname, 1, 1) and ContainsRegex(/,/,SpanBetween(LN.lastname, FN.firstname)); -- relaxed version of Rule4a -- Yunyao: split the following rules into two to improve performance -- TODO: Test case for optimizer -- create view Person4ar1 as -- select CombineSpans(CP.name, FN.firstname) as person --from FirstName FN, -- CapsPerson CP --where FollowsTok(CP.name, FN.firstname, 1, 1) --and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname)) --and Not(MatchesRegex(/(.|\n|\r)*(\.|\?|!|'|\sat|\sin)( )*/, LeftContext(CP.name, 10))) --and Not(MatchesRegex(/(?i)(.+fully)/, CP.name)) --and GreaterThan(GetBegin(CP.name), 10); create view Person4ar1temp as select FN.firstname as firstname, CP.name as name from FirstName FN, CapsPerson CP where FollowsTok(CP.name, FN.firstname, 1, 1) and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname)); create view Person4ar1 as select CombineSpans(P.name, P.firstname) as person from Person4ar1temp P where Not(MatchesRegex(/(.|\n|\r)*(\.|\?|!|'|\sat|\sin)( )*/, LeftContext(P.name, 10))) --' and Not(MatchesRegex(/(?i)(.+fully)/, P.name)) and GreaterThan(GetBegin(P.name), 10); create view Person4ar2 as select CombineSpans(LN.lastname, CP.name) as person from CapsPerson CP, LastName LN where FollowsTok(LN.lastname, CP.name, 0, 1) and ContainsRegex(/,/,SpanBetween(LN.lastname, CP.name)); /** * Translation for Rule2 * * This rule will handles names of persons like B.M. Thomas David, where Thomas occurs in some person dictionary */ /* INITIALWORD CAPSPERSON CAPSPERSON */ create view Person2 as select CombineSpans(IW.word, CP.name) as person from InitialWord IW, PersonDict P, CapsPerson CP where FollowsTok(IW.word, P.name, 0, 0) and FollowsTok(P.name, CP.name, 0, 0); /** * Translation for Rule 2a * * The rule handles names of persons like B.M. Thomas David, where David occurs in some person dictionary */ /* INITIALWORD CAPSPERSON NEWLINE ? CAPSPERSON */ create view Person2a as select CombineSpans(IW.word, P.name) as person from InitialWord IW, CapsPerson CP, PersonDict P where FollowsTok(IW.word, CP.name, 0, 0) and FollowsTok(CP.name, P.name, 0, 0); /* CAPSPERSON NEWLINE ? CAPSPERSON */ create view Person4r1 as select CombineSpans(FN.firstname, CP.name) as person from FirstName FN, CapsPerson CP where FollowsTok(FN.firstname, CP.name, 0, 0); /** * Translation for Rule 4r2 * * This relaxed version of rule '4' will find person names Thomas, David * But it only insists that the SECOND word is in some person dictionary */ /* ANYWORD CAPSPERSON NEWLINE ? CAPSPERSON */ create view Person4r2 as select CombineSpans(CP.name, LN.lastname) as person from CapsPerson CP, LastName LN where FollowsTok(CP.name, LN.lastname, 0, 0); /** * Translation for Rule 5 * * This rule will find other single token person first names */ /* INITIALWORD ? CAPSPERSON */ create view Person5 as select CombineSpans(IW.word, FN.firstname) as person from InitialWord IW, FirstName FN where FollowsTok(IW.word, FN.firstname, 0, 0); /** * Translation for Rule 6 * * This rule will find other single token person last names */ /* INITIALWORD ? CAPSPERSON */ create view Person6 as select CombineSpans(IW.word, LN.lastname) as person from InitialWord IW, LastName LN where FollowsTok(IW.word, LN.lastname, 0, 0); -- ================================================== ======== -- End of rules -- -- Create final list of names based on all the matches extracted -- -- ================================================== ======== /** * Union all matches found by strong rules, except the ones directly come * from dictionary matches */ create view PersonStrongWithNewLine as (select P.person as person from Person1 P) --union all -- (select P.person as person from Person1a_more P) union all (select P.person as person from Person3 P) union all (select P.person as person from Person4 P) union all (select P.person as person from Person3P1 P); create view PersonStrongSingleTokenOnly as (select P.person as person from Person5 P) union all (select P.person as person from Person6 P) union all (select P.firstname as person from FirstName P) union all (select P.lastname as person from LastName P) union all (select P.person as person from Person1a P); -- Yunyao: added 05/09/2008 to expand person names with suffix create view PersonStrongSingleTokenOnlyExpanded1 as select CombineSpans(P.person,S.suffix) as person from PersonStrongSingleTokenOnly P, PersonSuffix S where FollowsTok(P.person, S.suffix, 0, 0); -- Yunyao: added 04/14/2009 to expand single token person name with a single initial -- extend single token person with a single initial create view PersonStrongSingleTokenOnlyExpanded2 as select CombineSpans(R.person, RightContext(R.person,2)) as person from PersonStrongSingleTokenOnly R where MatchesRegex(/ +[\p{Upper}]\b\s*/, RightContext(R.person,3)); create view PersonStrongSingleToken as (select P.person as person from PersonStrongSingleTokenOnly P) union all (select P.person as person from PersonStrongSingleTokenOnlyExpanded1 P) union all (select P.person as person from PersonStrongSingleTokenOnlyExpanded2 P); /** * Union all matches found by weak rules */ create view PersonWeak1WithNewLine as (select P.person as person from Person3r1 P) union all (select P.person as person from Person3r2 P) union all (select P.person as person from Person4r1 P) union all (select P.person as person from Person4r2 P) union all (select P.person as person from Person2 P) union all (select P.person as person from Person2a P) union all (select P.person as person from Person3P2 P) union all (select P.person as person from Person3P3 P); -- weak rules that identify (LastName, FirstName) create view PersonWeak2WithNewLine as (select P.person as person from Person4a P) union all (select P.person as person from Person4ar1 P) union all (select P.person as person from Person4ar2 P); --include 'core/GenericNE/Person-FilterNewLineSingle.aql'; --include 'core/GenericNE/Person-Filter.aql'; create view PersonBase as (select P.person as person from PersonStrongWithNewLine P) union all (select P.person as person from PersonWeak1WithNewLine P) union all (select P.person as person from PersonWeak2WithNewLine P); output view PersonBase; from Dictionary('names/name_israel.dict', Doc.text) D where MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match); create view NamesAll as (select P.name as name from NameDict P) union all (select P.name as name from NameDict1 P) union all (select P.name as name from NameDict2 P) union all (select P.name as name from NameDict3 P) union all (select P.name as name from NameDict4 P) union all (select P.firstname as name from FirstName P) union all create view PersonDict as select C.name as name --from Consolidate(NamesAll.name) C; from NamesAll C consolidate on C.name; --========================================================== -- Actual Rules --========================================================== -- For 3-part Person names create view Person3P1 as select CombineSpans(F.firstname, L.lastname) as person from StrictFirstName F, StrictCapsPersonR S, StrictLastName L where FollowsTok(F.firstname, S.name, 0, 0) --and FollowsTok(S.name, L.lastname, 0, 0) and FollowsTok(F.firstname, L.lastname, 1, 1) and Not(Equals(GetText(F.firstname), GetText(L.lastname))) and Not(Equals(GetText(F.firstname), GetText(S.name))) and Not(Equals(GetText(S.name), GetText(L.lastname))) and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(F.firstname, L.lastname))); create view Person3P2 as select CombineSpans(P.name, L.lastname) as person from PersonDict P, StrictCapsPersonR S, StrictLastName L where FollowsTok(P.name, S.name, 0, 0) --and FollowsTok(S.name, L.lastname, 0, 0) and FollowsTok(P.name, L.lastname, 1, 1) and Not(Equals(GetText(P.name), GetText(L.lastname))) and Not(Equals(GetText(P.name), GetText(S.name))) and Not(Equals(GetText(S.name), GetText(L.lastname))) and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(P.name, L.lastname))); create view Person3P3 as select CombineSpans(F.firstname, P.name) as person from PersonDict P, StrictCapsPersonR S, StrictFirstName F where FollowsTok(F.firstname, S.name, 0, 0) --and FollowsTok(S.name, P.name, 0, 0) and FollowsTok(F.firstname, P.name, 1, 1) and Not(Equals(GetText(P.name), GetText(F.firstname))) and Not(Equals(GetText(P.name), GetText(S.name))) and Not(Equals(GetText(S.name), GetText(F.firstname))) and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(F.firstname, P.name))); /** * Translation for Rule 1 * Handles names of persons like Mr. Vladimir E. Putin */ /* CANYWORD CAPSPERSON INITIALWORD CAPSPERSON */ create view Person1 as select CombineSpans(CP1.name, CP2.name) as person from Initial I, CapsPerson CP1, InitialWord IW, CapsPerson CP2 where FollowsTok(I.initial, CP1.name, 0, 0) and FollowsTok(CP1.name, IW.word, 0, 0) and FollowsTok(IW.word, CP2.name, 0, 0); --and Not(ContainsRegex(/[\n\r]/, SpanBetween(I.initial, CP2.name))); /** * Translation for Rule 1a * Handles names of persons like Mr. Vladimir Putin */ /* CANYWORD CAPSPERSON {1,3} */ -- Split into two rules so that single token annotations are serperated from others -- Single token annotations create view Person1a1 as select CP1.name as person from Initial I, CapsPerson CP1 where FollowsTok(I.initial, CP1.name, 0, 0) --- start changing this block --- disallow allow newline and Not(ContainsRegex(/[\n\t]/,SpanBetween(I.initial,CP1.name))) --- end changing this block ; -- Yunyao: added 05/09/2008 to match patterns such as "Mr. B. B. Buy" /* create view Person1a2 as select CombineSpans(name.block, CP1.name) as person from Initial I, BlockTok(0, 1, 2, InitialWord.word) name, CapsPerson CP1 where FollowsTok(I.initial, name.block, 0, 0) and FollowsTok(name.block, CP1.name, 0, 0) and Not(ContainsRegex(/[\n\t]/,CombineSpans(I.initial, CP1.name))); */ create view Person1a as -- ( select P.person as person from Person1a1 P -- ) -- union all -- (select P.person as person from Person1a2 P) ; /* create view Person1a_more as select name.block as person from Initial I, BlockTok(0, 2, 3, CapsPerson.name) name where FollowsTok(I.initial, name.block, 0, 0) and Not(ContainsRegex(/[\n\t]/,name.block)) --- start changing this block -- disallow newline and Not(ContainsRegex(/[\n\t]/,SpanBetween(I.initial,name.block))) --- end changing this block ; */ /** * Translation for Rule 3 * Find person names like Thomas B.M. David */ /* CAPSPERSON INITIALWORD CAPSPERSON */ create view Person3 as select CombineSpans(P1.name, P2.name) as person from PersonDict P1, --InitialWord IW, WeakInitialWord IW, PersonDict P2 where FollowsTok(P1.name, IW.word, 0, 0) and FollowsTok(IW.word, P2.name, 0, 0) and Not(Equals(GetText(P1.name), GetText(P2.name))); /** * Translation for Rule 3r1 * * This relaxed version of rule '3' will find person names like Thomas B.M. David * But it only insists that the first word is in the person dictionary */ /* CAPSPERSON INITIALWORD CAPSPERSON */ create view Person3r1 as create view Initial as --'Junior' (Yunyao: comments out to avoid mismatches such as Junior National [team player], -- If we can have large negative dictionary to eliminate such mismatches, -- then this may be recovered --'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name') -- for German names -- TODO: need further test,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor', 'Herr Professor', 'Frau professor', 'Baron', 'graf' ); -- Find dictionary matches for all title initials select D.match as initial --'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name') -- for German names -- TODO: need further test,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor', 'Herr Professor', 'Frau professor', 'Baron', 'graf' ); -- Find dictionary matches for all title initials from Dictionary('InitialDict', Doc.text) D; -- Yunyao: added 05/09/2008 to capture person name suffix create dictionary PersonSuffixDict as ( ',jr.', ',jr', 'III', 'IV', 'V', 'VI' ); create view PersonSuffix as select D.match as suffix from Dictionary('PersonSuffixDict', Doc.text) D; -- Find capitalized words that look like person names and not in the non-name dictionary create view CapsPersonCandidate as select R.match as name --from Regex(/\b\p{Upper}\p{Lower}[\p{Alpha}]{1,20}\b/, Doc.text) R --from Regex(/\b\p{Upper}\p{Lower}[\p{Alpha}]{0,10}(['-][\p{Upper}])?[\p{Alpha}]{1,10}\b/, Doc.text) R -- change to enable unicode match --from Regex(/\b\p{Lu}\p{M}*[\p{Ll}\p{Lo}]\p{M}*[\p{L}\p{M}*]{0,10}(['-][\p{Lu}\p{M}*])?[\p{L}\p{M}*]{1,10}\b/, Doc.text) R --from Regex(/\b\p{Lu}\p{M}*[\p{Ll}\p{Lo}]\p{M}*[\p{L}\p{M}*]{0,10}(['-][\p{Lu}\p{M}*])?(\p{L}\p{M}*){1,10}\b/, Doc.text) R -- Allow fully capitalized words --from Regex(/\b\p{Lu}\p{M}*(\p{L}\p{M}*){0,10}(['-][\p{Lu}\p{M}*])?(\p{L}\p{M}*){1,10}\b/, Doc.text) R from RegexTok(/\p{Lu}\p{M}*(\p{L}\p{M}*){0,10}(['-][\p{Lu}\p{M}*])?(\p{L}\p{M}*){1,10}/, 4, Doc.text) R --' where Not(ContainsDicts( 'FilterPersonDict', 'filterPerson_position.dict', 'filterPerson_german.dict', 'InitialDict', 'StrongPhoneVariantDictionary', 'stateList.dict', 'organization_suffix.dict', 'industryType_suffix.dict', 'streetSuffix_forPerson.dict', 'wkday.dict', 'nationality.dict', 'stateListAbbrev.dict', 'stateAbbrv.ChicagoAPStyle.dict', R.match)); create view CapsPerson as select C.name as name from CapsPersonCandidate C where Not(MatchesRegex(/(\p{Lu}\p{M}*)+-.*([\p{Ll}\p{Lo}]\p{M}*).*/, C.name)) and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*-(\p{Lu}\p{M}*)+/, C.name)); -- Find strict capitalized words with two letter or more (relaxed version of StrictCapsPerson) --============================================================ --TODO: need to think through how to deal with hypened name -- one way to do so is to run Regex(pattern, CP.name) and enforce CP.name does not contain ' -- need more testing before confirming the change create view CapsPersonNoP as select CP.name as name from CapsPerson CP where Not(ContainsRegex(/'/, CP.name)); --' create view StrictCapsPersonR as select R.match as name --from Regex(/\b\p{Lu}\p{M}*(\p{L}\p{M}*){1,20}\b/, CapsPersonNoP.name) R; from RegexTok(/\p{Lu}\p{M}*(\p{L}\p{M}*){1,20}/, 1, CapsPersonNoP.name) R; --============================================================ -- Find strict capitalized words --create view StrictCapsPerson as create view StrictCapsPerson as select R.name as name from StrictCapsPersonR R where MatchesRegex(/\b\p{Lu}\p{M}*[\p{Ll}\p{Lo}]\p{M}*(\p{L}\p{M}*){1,20}\b/, R.name); -- Find dictionary matches for all last names create view StrictLastName1 as select D.match as lastname from Dictionary('strictLast.dict', Doc.text) D --where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); create view StrictLastName2 as select D.match as lastname from Dictionary('strictLast_german.dict', Doc.text) D --where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); create view StrictLastName3 as select D.match as lastname from Dictionary('strictLast_german_bluePages.dict', Doc.text) D --where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); create view StrictLastName4 as select D.match as lastname from Dictionary('uniqMostCommonSurname.dict', Doc.text) D --where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/\p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); create view StrictLastName5 as select D.match as lastname from Dictionary('names/strictLast_italy.dict', Doc.text) D where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); create view StrictLastName6 as select D.match as lastname from Dictionary('names/strictLast_france.dict', Doc.text) D where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); create view StrictLastName7 as select D.match as lastname from Dictionary('names/strictLast_spain.dict', Doc.text) D where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); create view StrictLastName8 as select D.match as lastname from Dictionary('names/strictLast_india.partial.dict', Doc.text) D where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); create view StrictLastName9 as select D.match as lastname from Dictionary('names/strictLast_israel.dict', Doc.text) D where MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match); create view StrictLastName as (select S.lastname as lastname from StrictLastName1 S) union all (select S.lastname as lastname from StrictLastName2 S) union all (select S.lastname as lastname from StrictLastName3 S) union all (select S.lastname as lastname from StrictLastName4 S) union all (select S.lastname as lastname from StrictLastName5 S) union all (select S.lastname as lastname from StrictLastName6 S) union all (select S.lastname as lastname from StrictLastName7 S) union all (select S.lastname as lastname from StrictLastName8 S) union all (select S.lastname as lastname from StrictLastName9 S); -- Relaxed version of last name create view RelaxedLastName1 as select CombineSpans(SL.lastname, CP.name) as lastname from StrictLastName SL, StrictCapsPerson CP where FollowsTok(SL.lastname, CP.name, 1, 1) and MatchesRegex(/\-/, SpanBetween(SL.lastname, CP.name)); create view RelaxedLastName2 as select CombineSpans(CP.name, SL.lastname) as lastname from StrictLastName SL, StrictCapsPerson CP where FollowsTok(CP.name, SL.lastname, 1, 1) and MatchesRegex(/\-/, SpanBetween(CP.name, SL.lastname)); -- all the last names create view LastNameAll as (select N.lastname as lastname from StrictLastName N) union all (select N.lastname as lastname from RelaxedLastName1 N) union all (select N.lastname as lastname from RelaxedLastName2 N); create view ValidLastNameAll as select N.lastname as lastname --------------------------------------- -- Document Preprocessing --------------------------------------- create view Doc as select D.text as text from DocScan D; ---------------------------------------- -- Basic Named Entity Annotators ---------------------------------------- -- Find initial words create view InitialWord1 as select R.match as word --from Regex(/\b([\p{Upper}]\.\s*){1,5}\b/, Doc.text) R from RegexTok(/([\p{Upper}]\.\s*){1,5}/, 10, Doc.text) R -- added on 04/18/2008 where Not(MatchesRegex(/M\.D\./, R.match)); -- Yunyao: added on 11/21/2008 to capture names with prefix (we use it as initial -- to avoid adding too many commplex rules) create view InitialWord2 as select D.match as word from Dictionary('specialNamePrefix.dict', Doc.text) D; create view InitialWord as (select I.word as word from InitialWord1 I) union all (select I.word as word from InitialWord2 I); -- Find weak initial words create view WeakInitialWord as select R.match as word --from Regex(/\b([\p{Upper}]\.?\s*){1,5}\b/, Doc.text) R; from RegexTok(/([\p{Upper}]\.?\s*){1,5}/, 10, Doc.text) R -- added on 05/12/2008 -- Do not allow weak initial word to be a word longer than three characters where Not(ContainsRegex(/[\p{Upper}]{3}/, R.match)) -- added on 04/14/2009 -- Do not allow weak initial words to match the timezon and Not(ContainsDict('timeZone.dict', R.match)); ----------------------------------------------- -- Strong Phone Numbers ----------------------------------------------- create dictionary StrongPhoneVariantDictionary as ( 'phone', 'cell', 'contact', 'direct', 'office', -- Yunyao: Added new strong clues for phone numbers 'tel', 'dial', 'Telefon', 'mobile', 'Ph', 'Phone Number', 'Direct Line', 'Telephone No', 'TTY', 'Toll Free', 'Toll-free', -- German 'Fon', 'Telefon Geschaeftsstelle', 'Telefon Geschftsstelle', 'Telefon Zweigstelle', 'Telefon Hauptsitz', 'Telefon (Geschaeftsstelle)', 'Telefon (Geschftsstelle)', 'Telefon (Zweigstelle)', 'Telefon (Hauptsitz)', 'Telefonnummer', 'Telefon Geschaeftssitz', 'Telefon Geschftssitz', 'Telefon (Geschaeftssitz)', 'Telefon (Geschftssitz)', 'Telefon Persnlich', 'Telefon persoenlich', 'Telefon (Persnlich)', 'Telefon (persoenlich)', 'Handy', 'Handy-Nummer', 'Telefon arbeit', 'Telefon (arbeit)' ); --include 'core/GenericNE/Person.aql'; create dictionary FilterPersonDict as ( 'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher', 'All','Tell', 'Friends', 'Friend', 'Colleague', 'Colleagues', 'Managers','If', 'Customer', 'Users', 'User', 'Valued', 'Executive', 'Chairs', 'New', 'Owner', 'Conference', 'Please', 'Outlook', 'Lotus', 'Notes', 'This', 'That', 'There', 'Here', 'Subscribers', 'What', 'When', 'Where', 'Which', 'With', 'While', 'Thanks', 'Thanksgiving','Senator', 'Platinum', 'Perspective', 'Manager', 'Ambassador', 'Professor', 'Dear', 'Contact', 'Cheers', 'Athelet', 'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center', 'The', 'Take', 'Junior', 'Both', 'Communities', 'Greetings', 'Hope', 'Restaurants', 'Properties', 'Let', 'Corp', 'Memorial', 'You', 'Your', 'Our', 'My', 'His','Her', 'Their','Popcorn', 'Name', 'July', 'June','Join', 'Business', 'Administrative', 'South', 'Members', 'Address', 'Please', 'List', 'Public', 'Inc', 'Parkway', 'Brother', 'Buy', 'Then', 'Services', 'Statements', 'President', 'Governor', 'Commissioner', 'Commitment', 'Commits', 'Hey', 'Director', 'End', 'Exit', 'Experiences', 'Finance', 'Elementary', 'Wednesday', 'Nov', 'Infrastructure', 'Inside', 'Convention', 'Judge', 'Lady', 'Friday', 'Project', 'Projected', 'Recalls', 'Regards', 'Recently', 'Administration', 'Independence', 'Denied', 'Unfortunately', 'Under', 'Uncle', 'Utility', 'Unlike', 'Was', 'Were', 'Secretary', 'Speaker', 'Chairman', 'Consider', 'Consultant', 'County', 'Court', 'Defensive', 'Northwestern', 'Place', 'Hi', 'Futures', 'Athlete', 'Invitational', 'System', 'International', 'Main', 'Online', 'Ideally' -- more entries,'If','Our', 'About', 'Analyst', 'On', 'Of', 'By', 'HR', 'Mkt', 'Pre', 'Post', 'Condominium', 'Ice', 'Surname', 'Lastname', 'firstname', 'Name', 'familyname', -- Italian greeting 'Ciao', -- Spanish greeting 'Hola', -- French greeting 'Bonjour', -- new entries 'Pro','Bono','Enterprises','Group','Said','Says','Assistant',' Vice','Warden','Contribution', 'Research', 'Development', 'Product', 'Sales', 'Support', 'Manager', 'Telephone', 'Phone', 'Contact', 'Information', 'Electronics','Managed','West','East','North','South', 'Teaches','Ministry', 'Church', 'Association', 'Laboratories', 'Living', 'Community', 'Visiting', 'Officer', 'After', 'Pls', 'FYI', 'Only', 'Additionally', 'Adding', 'Acquire', 'Addition', 'America', -- short phrases that are likely to be at the start of a sentence 'Yes', 'No', 'Ja', 'Nein','Kein', 'Keine', 'Gegenstimme', -- TODO: to be double checked 'Another', 'Anyway','Associate', 'At', 'Athletes', 'It', 'Enron', 'EnronXGate', 'Have', 'However', 'Company', 'Companies', 'IBM','Annual', -- common verbs appear with person names in financial reports -- ideally we want to have a general comprehensive verb list to use as a filter dictionary 'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees', 'Warns', 'Announces', 'Reviews' -- Laura 06/02/2009: new filter dict for title for SEC domain in filterPerson_title.dict ); create dictionary GreetingsDict as ( 'Hey', 'Hi', 'Hello', 'Dear', -- German greetings 'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo', -- Italian 'Ciao', -- Spanish 'Hola', -- French 'Bonjour' ); create dictionary InitialDict as ( 'rev.', 'col.', 'reverend', 'prof.', 'professor.', 'lady', 'miss.', 'mrs.', 'mrs', 'mr.', 'pt.', 'ms.', 'messrs.', 'dr.', 'master.', 'marquis', 'monsieur', 'ds', 'di' --'Dear' (Yunyao: comments out to avoid mismatches such as Dear Member), --'Junior' (Yunyao: comments out to avoid mismatches such as Junior National [team player], -- If we can have large negative dictionary to eliminate such mismatches, -- then this may be recovered Extracts Person, Organization, Locations, > 400 extraction rules > 200 dictionaries (in addition to features like regular expressions, parts of speech, capitalization, punctuation) Extracts Person, Organization, Locations, > 400 extraction rules > 200 dictionaries (in addition to features like regular expressions, parts of speech, capitalization, punctuation) Actual Extractors in IBM-Almaden SystemT Slide 16 6/24/2013SIGMOD 201316 Complex Objective Function 2 * G -s G o + G -s + B -s New F-score after deleting S = G o = original #true positives G -s = remaining #true positives after deleting S B -s = remaining #false positives after deleting S Both numerator and denominator depend on S (even if we try to rewrite the expression) Slide 17 Rest of the Talk Optimization Problem A Simple Class of Extractors (Simple-Rules) General Case (Complex Rules) Handling Incomplete Labeling Experimental Evaluation Related Work and Conclusions 6/24/2013SIGMOD 201317 Slide 18 Simple Rules Find matches of dictionary entries from text 6/24/2013SIGMOD 201318 name.dict This April, mark your calendars for the first derby of the season: Arsenal at Chelsea.April Smith and John Lee reporting live from .David said that April Chelsea April John David Person w3w3 w3w3 w8w8 w8w8 w1w1 w1w1 w 1 : chelsea w 2 : john w 3 : april w 8 : david w3w3 w3w3 w8w8 w8w8 1.Provenance has a simple form 2.One-to-many (not many-to-many) Has independent applications Optimization is not that simple! Slide 19 Results: Simple Rules 6/24/2013SIGMOD 201319 Greedy is not optimal NP-hard (reduction from the subset-sum problem) Near optimal Algorithm (simple, provably close to optimal) Optimal Algorithm Some details next Slide 20 Sketch of Optimal Algorithm for Simple Rules, Size Constraint |S| k 6/24/2013SIGMOD 201320 1. Guess the optimal F-score 2. Verify if there exists a subset S, |S| k, giving this F-score 3. Repeat by binary-search in [0, 1] until the optimal is found 2 * G -s G o + G -s + B -s F -s = G -s (2 - ) - B -s - G o 0 G -s = G o - w S G w B -s = B o - w S B w w S f(G w,B w ) Const, where |S| k Does not work for general case (many-to-many) Binary search on real numbers in [0, 1] (still poly-time) Top-k problem, poly-time! We want to handle the F-score ratio Slide 21 What about General Case (Complex Rules)? 6/24/2013SIGMOD 201321 Arbitrary extraction rules Arbitrary provenance Many-to-many dependency April Smith John Lee David w 5 + w 3 w 4 w3w3 w3w3 w 2 w 6 Slide 22 Results: Complex Rules Simple Rules Provenance: w Optimal Algorithm NP-hard Near optimal Algorithm 6/24/2013SIGMOD 201322 Efficient Heuristics Sketch: Find an initial solution Improve solution by hill-climbing Efficient Heuristics Sketch: Find an initial solution Improve solution by hill-climbing NP-hard even for two dictionaries (reduction from the k-densest subgraph problem) Requires evaluation of Boolean provenance Slide 23 What if not all the results are labeled? 6/24/2013SIGMOD 201323 April Chelsea April Smith John Lee David w 5 + w 3 w 4 w3w3 w3w3 w7w7 w7w7 w1w1 w1w1 w 2 w 6 Labeling is expensive Ignoring unlabeled results may lead to over-fitting We estimate missing labels So far we assumed all results are labeled as true positive / false positive Slide 24 6/24/2013SIGMOD 201324 Simple RulesComplex Rules Estimating Missing Labels April Smith John Lee David w 5 + w 3 w 4 w3w3 w3w3 w 2 w 6 April Chelsea David April Chelsea April David w3w3 w3w3 w7w7 w7w7 w1w1 w1w1 w3w3 w3w3 w7w7 w7w7 w3w3 w3w3 w1w1 w1w1 Possible approach: Label of an entry = Empirical fraction of true positives w 3 april: 0.33 w 1 chelsea: 0.50 w 7 david: 1.00 Empirical estimation does not work! Arbitrary monotone Boolean expressions Very few or no labels available! We assume a statistical model and estimate labels using Expectation-Maximization (EM) algorithm 0 1 w7w7 w7w7 w3w3 w3w3 w1w1 w1w1 Reduces to Chelsea April David 0.33 Slide 25 Experimental Evaluation 6/24/2013SIGMOD 201325 Slide 26 Evaluation Experimental Setting IE system: SystemT (IBM Almaden) Rule language: AQL (Annotation Query Language) Dataset: CoNLL, ACE Person/Organization extractors (1 rule to more than 400 rules) Experimental Setting IE system: SystemT (IBM Almaden) Rule language: AQL (Annotation Query Language) Dataset: CoNLL, ACE Person/Organization extractors (1 rule to more than 400 rules) 6/24/2013SIGMOD 201326 Purpose of the experiments Optimization: Compare the final F-scores and Running time of algorithms For all cases of Simple/Complex rules and Size/Recall constraints Label Estimation: Compare our label estimation approach with optimization only on the labeled data Qualitative Evaluation: whether the returned entries are meaningful Purpose of the experiments Optimization: Compare the final F-scores and Running time of algorithms For all cases of Simple/Complex rules and Size/Recall constraints Label Estimation: Compare our label estimation approach with optimization only on the labeled data Qualitative Evaluation: whether the returned entries are meaningful Slide 27 Sample Experiment Optimization 6/24/2013SIGMOD 201327 Simple rules, size constraint Optimal algorithm performs better than other approaches (and is also efficient-in the paper) Slide 28 Sample Experiment Label Estimation 6/24/2013SIGMOD 201328 Simple rules, size constraint -F-score improves when the number of labeled results increase -Label estimation gives better result Fraction of Labeled Results 100% 50%25% 12.5%6.25%3.125% Slide 29 Sample Experiment Qualitative Evaluation 6/24/2013SIGMOD 201329 More experiments can be found in the paper 10% Labeled (test) dataset Actual dataset china(0, 12)(0, 100) kong(0, 11)(0, 70) june(0, 9)(0, 97) hong(1, 10)(2, 71) september(0, 8)(0, 101) king(0, 5)(6, 20) louis(1, 6)(4, 33) Top-7 Output by our optimal algorithm (Simple rules, size constraint) On person-name dictionary with 13k entries Incomplete 10% Labeled Data Bad/Ambiguous entries as person names! Slide 30 Related Work Entity Extraction Agichtein et. al.00, Elmeleegy et. al. 09, Riloff et. al. 93, Yates et. al. 07, etc. Named Entity Disambiguation Hoffart et. al. 11, etc. Rule Refinement in IE Liu et. al. 10, Chai et. al. 09, Shen et. al. 08 etc. Causality Meliou et. al. 11 etc. Deletion Propagation Buneman et. al. 02, Kimelfeld et. al. 11 etc 6/24/2013SIGMOD 201330 Our work.. Orthogonal to these approaches and can be used in addition to them Alternative Objective Special Case Slide 31 Conclusions Summary of our contributions: Dictionary refinement as an optimization problem Theoretical analysis and Experimental Evaluation Handling incomplete labels Future Work Better model for label estimation that considers correlation Adaptively labeling results 6/24/2013SIGMOD 201331 Slide 32 Thank you 6/24/2013SIGMOD 201332 Questions?