Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
1
Adaptive TextExtraction and Mining
Nicholas KushmerickDepartment of Computer Science
University College Dublin
[email protected]/staff/nick
Fabio CiravegnaDepartment of Computer Science
University of Sheffield
[email protected]/~fabio
ECML-2003 Tutorial
Ciravegna & Kushmerick: ECML-2003 Tutorial2
What is IE
What can we extract from the Weband why?
n Introduction: (20 minutes)
n what is IE
n What can we extract from the Web
n Why?
n Algorithms and methodologies (100 min)
n IE in practice (30 min)
n Conclusion, Future Work (10 min)
n Discussion
Ciravegna & Kushmerick: ECML-2003 Tutorial3
The ‘canonical’ IE task
n Input:n Document
n newspaper article, Web page, email message, …
n Pre-defined “information need” n frame slots, template fillers, database tuples, …
n Outputn The specific substrings/fragments of the document or labels that
satisfy the stated information need, possibly organised in a template
• DARPA’s ‘Message Understanding Conferences/Competitions’ since late1980’s; most recent: MUC-7, 1998.
• Recent interest in the machine learning and Web communities.Ciravegna & Kushmerick: ECML-2003 Tutorial4
IE Standard Tasks
n Preprocessingn Tokenizationn Morphological Analysisn Part of Speech Tagging
n Information Identificationn Named Entity Recognitionn Template Filling (from the MUC)
n Template Elementsn Template Relationsn Scenario Template
Ciravegna & Kushmerick: ECML-2003 Tutorial5
Moody’s
C$115 millionMoody's Investors Service Inc
Organisation
MNY
%
19:16 rates Province of Saskatchewan A3
said it assigned an A3 rating to the Province of Saskatchewan's bond offering that was priced today.The sale is a reopening of the province's 9.6 percent bonds due February 4, 2022. Proceeds will be used for government purposes, mainly Saskatchewan Power Corp.
NE Recognition & Coreference
Date
& Coreference
Ciravegna & Kushmerick: ECML-2003 Tutorial6
19:16 Moody's rates Province of Saskatchewan A3
Moody's Investors Service Inc said it assigned an A3 rating to the Province of Saskatchewan's C$115 million bond offering that was priced today.The sale is a reopening of the province's 9.6 percent bonds due February 4, 2022. Proceeds will be used for government purposes, mainly Saskatchewan Power Corp.
Template Filling
amount C$115 million
issuer Province of Saskatchewan
placement-date today
maturity February 4, 2022
rate 9.6 percent
2
Intranet
The Big Picture
queryprocessor
database
Web
IE
ontology
Ciravegna & Kushmerick: ECML-2003 Tutorial8
NYU Architecture: a MUC architecture
LexicalAnalysis
CoreferenceAnalysis
Name Recognition
PartialParsing
Scenario Pattern Matching
Local Text Analysis
Inference
Discourse Analysis
TemplateGeneration
Ciravegna & Kushmerick: ECML-2003 Tutorial9
Semantic Web
n A brain for Human Kindn From Information-based to Knowledge-Based n Processable Knowledge means:
n Better Retrievaln Reasoning
n Where can IE contribute?
Ciravegna & Kushmerick: ECML-2003 Tutorial10
Building the SW
n Document annotationn Manually associate documents (or parts) to
ontological descriptionsn Document classification for retrieval
n Where can I buy an Hamster?n Pet shop web page -> pet shop concept -> hamster
n Knowledge annotationn Where can I find a hotel in Berlin where single rooms cost
less than 400€?n The Hotel is located in central Berlin and the cost for a
single room is 300€
n Editors are currently available for manual annotation of texts
Ciravegna & Kushmerick: ECML-2003 Tutorial11
IE for Annotating Documents
n Manual annotation is n Expensiven Error prone
n IE can be used for annotating documentsn Automaticallyn Semi-Automatically
n As user support
n Advantagesn Speedn Low costn Consistencyn Can provide automatic annotation different from the one
provided by the author(!)
Ciravegna & Kushmerick: ECML-2003 Tutorial12
SW for Knowledge Management
n SW is important for everyday Internet usersn SW is necessary for large companies
n Millions of documents where knowledge is interspersed
n Most documents are now n web-basedn Available over an Intranet
n Companies are valued for theirn Tangible assets (e.g. plants)n Intangible assets (e.g. knowledge)
n Knowledge is stored in n mind of employeesn Documentation
n Companies spend 7-10% of revenues for KM
3
Ciravegna & Kushmerick: ECML-2003 Tutorial13
Why Adaptive Systems?
n Writing IE systems by hand is difficult and error pronen Extraction languages can be quite complexn Tedious write-test-debug-rewrite cycle
n Adaptive systems learn from user annotationsn the person tells the learning algorithm what to extract:
The learner figures out how
n Advantagesn Annotating text is simpler & faster than writing rules.n Domain independentn Domain experts don’t need to be linguists or programers.n Learning algorithms ensure full coverage of examples.
Ciravegna & Kushmerick: ECML-2003 Tutorial14
Algorithms and Methodologies
n Introduction: (20 minutes)
n Algorithms and methodologies (100 min)
n Wrapper induction
n Boosted wrapper inductionn Hidden Markov modelsn Exploiting linguistic constraints
n IE in practice (30 min)
n Conclusion, Future Work (10 min)
n Discussion
A dip into the details of IE for the Web
Ciravegna & Kushmerick: ECML-2003 Tutorial15
Algorithms: Outline
n Wrappersn Hand-coded wrappersn Wrapper inductionn Learning highly expressive wrappersn Boosted wrapper inductionn Hidden Markov modelsn Exploiting linguistic constraints
naturaltext
structureddata
Ciravegna & Kushmerick: ECML-2003 Tutorial16
Wrapper induction
Highly regularsource documents
⇓Relatively simple
extraction patterns
⇓Efficient
learning algorithms
Ciravegna & Kushmerick: ECML-2003 Tutorial17
⟨ (Congo, 242) (Egypt, 20) (Belize, 501) (Spain, 34) ⟩
Wrappers: Example and Toolkits
nWrapper toolkits: Specialized programming environmentsfor writing & debugging wrappers by hand
nExamples
nWorld Wide Web Wrapper Factory[ db.cis.upenn.edu/W4F ]
nJava Extraction & Dissemination of Information[ www.darmstadt.gmd.de/oasys/projects/jedi ]
Ciravegna & Kushmerick: ECML-2003 Tutorial18
Use <B>, </B>, <I>, </I> for extraction
<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
+
Wrappers: Delimiter- based extraction
4
Ciravegna & Kushmerick: ECML-2003 Tutorial19
procedure ExtractCountryCodeswhile there are more occurrences of <B>
1. extract Country between <B> and </B>2. extract Code between <I> and </I>
procedure ExtractAttributes:while there are more occurrence of l1
1. extract 1st attribute between l1 and r1
. . .K. extract Kth attribute between lK and rK
left delimiters right delimiters
“Left- Right” wrappers
[Kushmerick et al, IJCAI-97; Kushmerick AIJ-2000]
Left-Right wrapper ≡ 2K strings⟨l1 , r1 , …, lK , rK⟩
Ciravegna & Kushmerick: ECML-2003 Tutorial20
Thai food is spicy.Vietnamese food is spicy.German food isn’t spicy.
Ù Asian foodis spicy.
<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>
wrapperÙ
examples Ù hypothesis
Wrapper induction
Ciravegna & Kushmerick: ECML-2003 Tutorial21
⟨l1, r1, …, lK, rK⟩
Example: Find 4 strings⟨<B>, </B>, <I>, </I>⟩⟨ l1 , r1 , l2 , r2 ⟩
labeled pages wrapper<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
Learning LR wrappers
Ciravegna & Kushmerick: ECML-2003 Tutorial22
LR: Finding r1
<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
r1 can be any prefix
eg </B>
Ciravegna & Kushmerick: ECML-2003 Tutorial23
LR: Finding l1, l2 and r2
<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
r2 can be any prefix
eg </I>
l2 can be any suffixeg <I>
l1 can be any suffixeg <B>
Ciravegna & Kushmerick: ECML-2003 Tutorial24
Distracting text in head and tail
<HTML><TITLE>Some Country Codes</TITLE><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>
A problem with LR wrappers
5
Ciravegna & Kushmerick: ECML-2003 Tutorial25
Ignore page’s head and tail
<HTML><TITLE>Some Country Codes</TITLE><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>
head
body
tail
}
}}
start of tail
end of head
+Head-Left-Right-Tail wrappers
One (of many) solutions: HLRT
Ciravegna & Kushmerick: ECML-2003 Tutorial26
Expressiveness
all sites
Siteswrappable
by LR
aiteswrappableby HLRT
Theorem:
Ciravegna & Kushmerick: ECML-2003 Tutorial27
Coverage
Fraction of randomly-selected “data-heavy” Web sites(search engines, retail, weather, news, finance, …)for which wrapper in a given class was learned.
Ciravegna & Kushmerick: ECML-2003 Tutorial28
Sample complexity
n The key problem with machine learning: Training data is expensive and tedious to generate
n In practice, active learning and specialized algorithms have reduced training requirements considerably
n But this isn’t theoretically satisfyingn Computational Learning Theory:
n Time complexity: Time require by an algorithm to terminate, as a function of problem parameters
n Sample complexity: Training data required by a learning algorithm to converge to correct hypothesis, as a function of problem parameters
Ciravegna & Kushmerick: ECML-2003 Tutorial29
A Model of Sample ComplexityP[correct wrapper] = f( size of documents,
number of documents,number of attributes per record )
Analyze wrapper learningtask to derive this function
(Actually, we can compute only a bound on this probability.)
Just like time/space complexity:
time[learn wrapper] = g( size of documents,number of documents,number of attributes per record )
Ciravegna & Kushmerick: ECML-2003 Tutorial30
PAC results - LR wrappers
Theorem: Suppose we learn LR wrapper W from training set E, where the longest document has length R and each record contains K attributes. If
then W is probably approximately correct
error(W) < ε
with probability at least 1-δ
|E|
6
Ciravegna & Kushmerick: ECML-2003 Tutorial31
More sophisticated wrappers
n LR & HLRT wrappers are extremely simple(though useful for ~ 2/3 of real Web sites!)
n Recent wrapper induction research has explored… n more expressive wrapper classes
[Muslea et al, Agents-98; Hsu et al, JIS-98; Thomas et al, JIS-00, …]
n Disjunctive delimitersn Sequential/landmark-based delimitersn Multiple attribute orderingsn Missing attributesn Multiple-valued attributesn Hierarchically nested data
n Wrapper verification/maintenance[Kushmerick, AAAI-1999; Kushmerick WWWJ-00;Cohen, AAAI-1999; Minton et al, AAAI-00]
Ciravegna & Kushmerick: ECML-2003 Tutorial32
One of my favorites
n Roadrunner[Valter Crescenzi et al; Univ Roma 3]
n Unsupervised wrapper inductionn They research databases, not machine learning, so
they didn’t realize training data was needed :-)
n Intuition:n Pose two different queriesn The common bits of the documents come from the
template and can be ignoredn The bits that are different are the data that we’re
looking for
Ciravegna & Kushmerick: ECML-2003 Tutorial33
Roadrunner - Example
n Common content = Part of templateVarying content = The data!
n Complications: Dynamic but unwanted content -- egadvertisements or timestamps
Ciravegna & Kushmerick: ECML-2003 Tutorial34
Algorithms: Outline
ü Wrappersü Hand-coded wrappersü Wrapper inductionü Learning highly expressive wrappersn Boosted wrapper inductionn Hidden Markov modelsn Exploiting linguistic constraints
naturaltext
structureddata
Ciravegna & Kushmerick: ECML-2003 Tutorial35
Boosted wrapper induction[Freitag & Kushmerick, AAAI-00]
n Wrapper induction is suitable only forrigidly-structured machine-generated HTML…
n … or is it?! n Can we use simple patterns to extract
from natural language documents?… Name: Dr. Jeffrey D. Hermes …… Who: Professor Manfred Paul …
... will be given by Dr. R. J. Pangborn …… Ms. Scott will be speaking …
… Karen Shriver, Dept. of ...… Maria Klawe, University of ...
Ciravegna & Kushmerick: ECML-2003 Tutorial36
BWI: The basic idea
n Learn “wrapper-like” patterns for natural textspattern = exact token sequence
n Learn many such “weak” patternsn Combine with boosting to build “strong”
ensemble patternn Of course, not all natural text is sufficiently
regular!n Demo: www.smi.ucd.ie/bwi
7
Ciravegna & Kushmerick: ECML-2003 Tutorial37
Covering Algorithms
n Generalization of Covering Algorithm for learning disjunctive rules
+++
++
++
--
- --
---
--
--
--
Ciravegna & Kushmerick: ECML-2003 Tutorial38
Covering Algorithms
n Generalization of Covering Algorithm for learning disjunctive rules
+++
++
++
--
- --
---
--
--
--
Learned Rule = rule
Ciravegna & Kushmerick: ECML-2003 Tutorial39
Covering Algorithms
n Generalization of Covering Algorithm for learning disjunctive rules
++
++
--
- --
---
--
--
--
Learned Rule = rule or rule
Ciravegna & Kushmerick: ECML-2003 Tutorial40
Covering Algorithms
n Generalization of Covering Algorithm for learning disjunctive rules
++
--
- --
---
--
--
--
Learned Rule = rule or rule or rule
Ciravegna & Kushmerick: ECML-2003 Tutorial41
Boosting = Generalized Covering
n When learn rules on iteration t, give less weight to (but don’t entirely discard) training examples successfully handled in iterations 1, 2, …, t-1
n Equivalently: Give more weight to training data that has not yetbeen covered
++-
-
- --
---
--
--
--
+
+
+
++
Ciravegna & Kushmerick: ECML-2003 Tutorial42
D1(i) = uniform distribution overtraining examples
for t = 1, ..., Ttrain: use distribution Dt to
learn weak hypothesis:ht: X⇒R
reweight: choose a t, and modifydistribution Dt to emphasizeexamples missed by ht:Dt+1(i)= Dt(i) exp(-a tyiht(xi))
return:H(x) = sign(S ta tht(x))
h1
h2
h3...
D0
D1
D2
Boosting [Schapire & Singer, 1998]
8
Ciravegna & Kushmerick: ECML-2003 Tutorial43
Boundary Detector: [who :][dr . <Capitalized>]prefix suffix
matches (e.g.) “… Who: Dr. Richard Nixon …”
Greedy growth from null detectorPick best prefix/suffix extension at each stepStop when no further extension improves accuracy
Weightinga t = ½ ln[(W+ + e) / (W- + e)]
[Cohen & Singer, 1999]
Weak hypotheses: Boundary Detectors
Weak Learning Algorithm
Ciravegna & Kushmerick: ECML-2003 Tutorial44
input: labeled documents
Fore = Adaboost fore detectorsAft = Adaboost aft detectorsLengths = length histogram
output: Extractor =<Fore, Aft, Lengths>
Traininginput: Document, Extractor, t
F = {⟨i, ci⟩ | token i matches Forewith confidence ci}
A = {⟨j, cj⟩| token j matches Aftwith confidence cj}
output:{⟨i,j⟩ | ⟨i, ci ⟩∈F, ⟨j, cj ⟩∈A,
ci·cj·L(j-i) > t }
Execution
Boosted Wrapper Induction
Ciravegna & Kushmerick: ECML-2003 Tutorial45
<[email protected]>Type: cmu.andrew.official.cmu-newsTopic: Chem. Eng. SeminarDates: 2-May-95Time: 10:45 AMPostedBy: Bruce Gerson on 26-Apr-95 at 09:31 from andrew.cmu.eduAbstract:
The Chemical Engineering Department will offer a seminar entitled"Creating Value in the Chemical Industry," at 10:45 a.m., Tuesday, May 2in Doherty Hall 1112. The seminar will be given by Dr. R. J. (Bob) Pangborn, Director, CentralResearch and Development, The Dow Chemical Company.
[<Alph>][Dr . <Cap>][<Alph> by][<Cap>]
1.30.8
2.1 [<Cap>][( University]0.7 [<Alph>][, Director]
[ ]
2.1 0.7 0.05
0.1
10 119
Confidence of "Dr. R. J. (Bob) Pangborn" = 2.1·0.7·0.05 = 0.074
Fore Aft Length
BWI execution example
Ciravegna & Kushmerick: ECML-2003 Tutorial46
[speaker :][<Alph>][speaker <Any>][<FName>]
[<Cap>][<FName> <Any> <Punc> ibm]
Presentation Abstract Joe Cascio, IBMSet Constraints Alex Aiken (IBM, Almaden)
[. <Any>][is <ANum> <Cap>]
John C. Akbari is a Masters student atMichael A. Cusumano is an Associate Professor of
Lawrence C. Stewart is a Consultant Engineer at
Speaker: Reid Simmons, School of …
Samples of learned patterns
Ciravegna & Kushmerick: ECML-2003 Tutorial47
Evaluation
• Wrappers are usually 100% accurate, but perfection is generally impossible with natural text
• ML/IE community has a well developed evaluation methodology
• Cross-validation: Repeat many times - randomly select2/3 of the data for training, test on remaining 1/3.
• Precision: fraction of extracted items that are correct• Recall: fraction of actual items extracted• F1 = 2 / (1/P + 1/R)
• 16 IE tasks from 8 document collections
• Competitors: SRV, Rapier, HMM
seminar announcementsjob listingsReuters corporate acquisitionsCS department faculty lists
Zagats restaurant reviewsLA Times restaurant reviewsInternet Address FinderStock quote server
Ciravegna & Kushmerick: ECML-2003 Tutorial48
Results: 16 tasks x 4 algorithms
21cases
7cases
9
Ciravegna & Kushmerick: ECML-2003 Tutorial49
Boosted Wrapper Induction:Controversial(?) Conclusion
n Is the Great Web -vs- Natural Text Chasm more apparent than real?
n IE is possible if the documents contain regularities that can be exploited
n But the “reason” (eg, linguistic -vs- markup) for these regularities doesn’t much matter
n See also Soderland’s WHISK & Webfoot
Ciravegna & Kushmerick: ECML-2003 Tutorial50
Algorithms: Outline
ü Wrappersü Hand-coded wrappersü Wrapper inductionü Learning highly expressive wrappersü Boosted wrapper induction§ Hidden Markov modelsn Exploiting linguistic constraints
naturaltext
structureddata
Ciravegna & Kushmerick: ECML-2003 Tutorial51
Hidden Markov models
nPrevious discussion examine systems that use explicit extraction patterns/rulesnHMMs are a powerful alternative based on statistical token models rather than explicit extraction patterns.
[Leek, UC San Diego, 1997; Bikel et al, ANLP-97, MLJ 99; Freitag & McCallum, AAAI-99 MLIE Workshop; Seymore, McCallum & Rosenfeld, AAAI-99 MLIE Workshop; Freitag & McCallum, AAAI-2000]
Ciravegna & Kushmerick: ECML-2003 Tutorial52
HMM formalism
HMM = states s1, s2, …special start state s1special end state sntoken alphabet a1, a2, …state transition probs P(si|sj)token emission probs P(ai|sj)
Widely used in many language processing tasks,e.g. speech recognition [Lee, 1989], POS tagging[Kupiec, 1992], topic detection [Yamron et al, 1998].
Ciravegna & Kushmerick: ECML-2003 Tutorial53
Applying HMMs to IE
n Document ⇒ generated by a stochastic process modelled by an HMM
n Token ⇒ wordn State ⇒ “reason/explanation” for a given token
n ‘Background’ state emits tokens like ‘the’, ‘said’, …n ‘Money’ state emits tokens like ‘million’, ‘euro’, …n ‘Organization’ state emits tokens like ‘university’, ‘company’,
…n Extraction: The Viterbi algorithm is a dynamic
programming technique for efficiently computing the most likely sequence of states that generated a document.
Ciravegna & Kushmerick: ECML-2003 Tutorial54
HMM for research papers [Seymore et al, 99]
10
Ciravegna & Kushmerick: ECML-2003 Tutorial55
Learning HMMs
n Good news: n If training data tokens are tagged with their generating states, then
simple frequency ratios are a maximum-likelihood estimate of transition/emission probabilities. (Use smoothing to avoid zero probsfor emissions/transitions absent in the training data.)
n Great news: n Baum-Welch algorithm trains HMM using unlabelled training data!
n Bad news: n How many states should the HMM contain?n How are transitions constrained?
n Insufficiently expressive ⇒ Unable to model important distinctionsn Overly-expressive ⇒ sparse training data, overfitting
Ciravegna & Kushmerick: ECML-2003 Tutorial56
HMM example
“Seminar announcements” task<[email protected]>Type: cmu.andrew.assocs.UEATopic: Re: entreprenuership speakerDates: 17-Apr-95Time: 7:00 PMPostedBy: Colin S Osburn on 15-Apr-95 at 15:11 from CMU.EDUAbstract:
hello againto reiteratethere will be a speaker on the law and startup businessthis monday evening the 17thit will be at 7pm in room 261 of GSIA in the new building, ieupstairs.please attend if you have any interest in starting your own business orare even curious.Colin
Ciravegna & Kushmerick: ECML-2003 Tutorial57
HMM example, continued
Fixed topology that captures limited context:4 “prefix” states before & 4 “suffix” after target state
pre1 pre2 pre3 pre4 suf1 suf2 suf3 suf4speaker
background5 most-probable tokens
\n . - : unknown
\nseminar
.roboticsunknown
\n:.-
unknown
\nwho
speaker:.
\n:.
with,
unknown.
drprofessormichael
\nunknown
.department
the
\nof.,
unknown
\nof
unknown.:
\n,
will(-
[Freitag, 99]Ciravegna & Kushmerick: ECML-2003 Tutorial58
Evaluation
21cases
7cases(no learning!)
Ciravegna & Kushmerick: ECML-2003 Tutorial59
Learning HMM structure [Seymore et al, 1999]
start with maximally-specific HMM (one state per observed word):
repeat(a) merge adjacent identical states
(b) eliminate redundant fan-out/in
until obtain good tradeoff between HMM accuracy and complexity
note auth auth title ⇒ note auth title
title auth
auth
auth
⇒ title auth
start note
title
auth
note
title
auth
note
auth
title
abst
abst
abst
end
abst
abst
………
Ciravegna & Kushmerick: ECML-2003 Tutorial60
Evaluation
hand-crafted HMMsimple HMM
learned HMM
(155 states)
11
Ciravegna & Kushmerick: ECML-2003 Tutorial61
Algorithms: Outline
ü Wrappersü Hand-coded wrappersü Wrapper inductionü Learning highly expressive wrappersü Boosted wrapper inductionü Hidden Markov modelsn Exploiting linguistic constraints
naturaltext
structureddata
Ciravegna & Kushmerick: ECML-2003 Tutorial62
Exploiting linguistic constraints
n IE research has its roots in the NLP communityn many extraction tasks require non-trivial linguistic
processing
n Web Documents types can range from free texts to rigid HTML documents (e.g. tables)
n Even a mixture of them!
n Is NLP robust enough tocope with such situations?
Showers in the NW by WednesdaySakdjlk lksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlk Klkjads lkjdslkajdlkajdlkajdlkasjd jdjdjd slkdjas Dhjjd jfj fjjfkksdl jfjuiekkf fkkf lkjeiikmkjflk lklk Sakdjlk lksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlk klsdf lksjdf lkjflskdjf lkjsdf lkjf lskdjf lkjfd lksjf lksjf lksjf lksjf l Klkjads lkjdslkajdlkajdlkajdlkasjd jdjdjd slkdjas k,mf sd.,m.jf kdkjsflk fds Dhjjd jfjfjjfkksdl jfjuiekkf fkkf lkjeiikm kjflk lklk m,nfndmsmnsd,mn dsfmn f Sakdjlklksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlk Klkjads lkjdslkajdlkajdlkajdlkasjd jdjdjd slkdjas Dhjjd jfj fjjfkksdl jfjuiekkf fkkf lkjeiikmkjflk lklk Sakdjlk lksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlkKlkjads lkjds lkajdlkajdlkajdlkasjd jdjdjd slkdjas Dhjjd jfj fjjfkksdl jfjuiekkffkkf lkjeiikm kjflk lklk kjdfkl slksjdf lkjsd lkj dslfkjlkjlksj lkjsdlf kjlfkjlkjflskjflksjf;kl;skf;lskf;slkf;lsdkf ;lk kjijkdsfjkh kjfhsk jfhiwuei ujlkj
Ciravegna & Kushmerick: ECML-2003 Tutorial63
Current Approaches
n NLP Approaches (MUC-like Approaches)
n Ineffective on most Web-related texts:n web pages/emailsn stereotypical but ungrammatical texts
n Extra-linguistic structures convey informationn HTML tags, Document formatting, Regular stereotypical
language
n Wrapper induction systemsn Designed for rigidly structured HTML textsn Ineffective on unstructured texts
n Approaches avoid generalization over flat word sequencen Data Sparseness on free texts
Ciravegna & Kushmerick: ECML-2003 Tutorial64
Lazy NLP based Algorithm
n Learns the best level of language analysis for a specific IE task mixing deep linguistic and shallow strategies
1. Initial rules: shallow wrapper-like rules2. Linguistic Information (LI) progressively added to rules3. Addition stopped when LI becomes
n unreliable n ineffective
n Lazy NLP learns best strategy for each information/context separatelyn Example:
n Using parsing for recognising the speaker in seminar announcements,
n Using shallow approaches to spot the seminar location
Ciravegna & Kushmerick: ECML-2003 Tutorial65
(LP)2 [Ciravegna 2001 – IJCAI 01- ATEM01]
n Covering algorithm based on LazyNlpn Single tag learning (e.g. </speaker>)
n Tagging Rulesn Insert annotation in texts
n Correction Rules n Correct imprecision in information identification by
shifting tags to the correct position
TBL-like, with some fundamental differences
Ciravegna & Kushmerick: ECML-2003 Tutorial66
Tagging and Correction Rules: examples
C o n d itio n o n W o r d s
A c tio n : I n se r t Tag
t h e s e m in a r
a t < tim e > 4
p m
the seminar at <time> 4 pm </time> will
Initial rules= window of conditions on words
The seminar at 4 </time> PM will be held in Room 201
Condition Action word wrong tag correct
tag at 4 </time>
pm </time>
12
Ciravegna & Kushmerick: ECML-2003 Tutorial67
Rule Generalisationn Each instance is generalised by reducing its pattern in length n Generalizations are tested on training corpusn Best k rules generated from each instance reporting:
n Smallest error rate (wrong/matches)n Greatest number of matchesn Cover different examples
n Conditions on words are replaced by information from NLP modulesn Capitalisationn Morphological analysis
n Generalizes over gender/numbern POS tagging
n Generalizes over lexical categoriesn User-defined dictionary or gazetteern Named Entity Recognizer
Implemented as a general to specific beam search with pruning (AQ-like)
Ciravegna & Kushmerick: ECML-2003 Tutorial68
Example of generalizationthe seminar at <time> 4 pm will
Details of the algorithm in [Ciravegna 2001 - ATEM01]
lowverbwillwill
timeidlownounpm
<time>lowdigit4
lowprepatat
lownounseminarseminar
lowdetthethe
TagSemCatCaseLexCatLemmaWord
ActionAdditional KnowledgeCondition
timeid
<time>digit
at
TagSemCatCaseLexCatLemmaWord
ActionCondition
Ciravegna & Kushmerick: ECML-2003 Tutorial69
CMU: detailed results
(LP)2 BWI HMM SRV Rapier Whisk speaker 77.6 67.7 76.6 56.3 53.0 18.3 location 75.0 76.7 78.6 72.3 72.7 66.4
stime 99.0 99.6 98.5 98.5 93.4 92.6 etime 95.5 93.9 62.1 77.9 96.2 86.0
All Slots 86.0 83.9 82.0 77.1 77.3 64.9
1. Best overall accuracy 2. Best result on speaker field3. No results below 75%
Ciravegna & Kushmerick: ECML-2003 Tutorial70
Effect of Generalization(1)Effectiveness and reduction in data sparseness
Slot (LP)2 G (LP)2 NG speaker 72.1 14.5 location 74.1 58.2
stime 100 97.4 etime 96.4 87.1
All slots 89.7 78.2
With comparable effectiveness on training corpus!
} Most Interesting
0306090
120150180210240270300330360390420450480510540
0 1 2 3 4 5 6 7 8 9 10
Cases Covered
Num
ber
of R
ules
Generalisation
No Generalisation
NLP- based generalisation14% rules cover 1 case42% cover up to 2 cases
Non- NLP50% rules cover 1 case71% cover up to 2 cases
Ciravegna & Kushmerick: ECML-2003 Tutorial71
Best level of Generalization
n ITC seminar announcements (mixed Italian/English)n Date, time, location generally in Italiann Speaker, title and abstract generally in
English
n English POS also for the Italian part
n NLP-based outperforms other version
Words POS NE
speaker 74.1 75.4 84.3
title 62.8 62.4 62.8
date 90.8 93.4 93.9
time 100 100 100
location 95.0 95.0 95.5
Ciravegna & Kushmerick: ECML-2003 Tutorial72
Linguistic constraints: Conclusions
n Linguistic phenomena can’t be handled by simple wrapper-like extraction patterns
n Even shallow linguistic processing (eg POS tagging) can improve performance dramatically.n NOTE: linguistic processing must be regular, not necessarily correct!n Example (LexCat:NNP + <BR> + <BR>)<SPEAKER>(NER:<person>)none of the covered 32 examples starts actually with an NNP
n What about more sophisticated NLP techniques?n Extension to parsing and corefernce resolution?
13
Ciravegna & Kushmerick: ECML-2003 Tutorial73
Putting IE into Practice
Enabling non-experts to port IE systems
n Introduction: (20 minutes)
n what is IE, what can we extract from the Web and why?
n Algorithms and methodologies (100 min)
n IE in practice (30 min)
n The adaptation problem (20 min)
n WEB + IE: examples of systems (10 min)
n Conclusion, Future Work (10 min)
n DiscussionCiravegna & Kushmerick: ECML-2003 Tutorial74
Motivation
n Impact on the web community will come only if:n IE systems are portable by non IE expertsn Low cost porting
n Non expertsn Need specific easy to use tools to:
n Design applicationn Tune applicationn Deliver application
n Need support during the whole IE application definition process
In summarising the summary of the summary: people are a problem.
Douglas AdamsThe Restaurant at the End of the Universe
Scenariodesign
Adapting theIE system
ApplicationDelivery
ResultValidation
User Needs
Application Development Cycle
Ciravegna & Kushmerick: ECML-2003 Tutorial76
Scenario design
n Task: mapping user whishes into templatesn Necessity:
n Supporting users in:n relevant information identificationn scenario organization
n Relevant Information Identification:n Different situations:
n User with developed scenarion System: no action, but…
n User with preliminary scenario to be refinedn System helps in refining
n User with no scenarion System helps in
n Identifying relevant informationn Organising it into a scenario
Ciravegna & Kushmerick: ECML-2003 Tutorial77
Training
n User can select unrepresentative corporan Unbalanced wrt genres
n System validates corpus wrt a large corpusn Comparing formal features
n Unwanted regularities (use of keywords for selection)n System looks for unusual regularities
n Irrelevant texts (sensitive information)n No solution to stupidity
Ciravegna & Kushmerick: ECML-2003 Tutorial78
Tagging Corpora
n Problems:n Tagging texts can:
n Be difficult and boringn Take a long time
n Effect:n Mistakes in taggingn High cost
n System:n reduce/eliminate need for annotated data
n Bootstrapping: from user-defined “seed examples” to system-retrieved similar examples
n Active learning: selection of examples to annotate from unlabeled corpus
Helps in discovering new relations
Helps in focusing on unusual information shape
14
Ciravegna & Kushmerick: ECML-2003 Tutorial79
Result Validation
n How well does the system perform?n Solution:
n Facilities for:n Inspecting tagged corpusn Showing details on correctness
n Statistics on corpusn Details on errors (highlight correct/incorrect/missing)
(e.g. MUC scorer is an excellent tool)
n Influencing system behaviorn Solution
n Interface for bridging the user’s qualitative vision and the system’s numerical vision
OK. Please modify an error threshold
Try to be more
accurate!
Ciravegna & Kushmerick: ECML-2003 Tutorial80
Application Delivery
n Problem:n Incoming texts deviate from training data
n Training corpus non representativen Document features change in time
n Solution:n Monitoring application.
n Warn user if incoming texts’ features are statistically different from training corpus:n Formal features: texts length, distribution of nounsn Semantic features: distribution of template fillers
Ciravegna & Kushmerick: ECML-2003 Tutorial81
Putting IE into Practice (2)
Some examples of Adaptive User-driven IE for real world applications
Ciravegna & Kushmerick: ECML-2003 Tutorial82
Learning Pinocchio
n Commercial tool for adaptive IEn Based on the (LP)2 algorithmn Adaptable to new scenarios/applications by:
n Corpus tagging via SGMLn A user with analyst’s knowledge
n Applicationsn “Tombstone” data from Resumees (Canadian company) (E)n IE from financial news (Kataweb) (I)n IE from classified ads (Kataweb) (I)n Information highlighting (intelligence)n (Many others I have lost track of…)
n A number of licenses released around the world for application development
[Ciravegna 2001 - IJCAI]http://tcc.itc.it/research/textec/tools-resources/learningpinocchio/
Ciravegna & Kushmerick: ECML-2003 Tutorial83
Application development time
Resumees:n Scenario definition: 10 person hours n Tagging 250 texts: 14 person hours n Rule induction: 72 hours on 450MHz computer n Result validation: 4 hours
Contact:Alberto [email protected]://tcc.itc.it/research/textec/tools-resources/learningpinocchio/
Ciravegna & Kushmerick: ECML-2003 Tutorial84
Amilcare active annotation for the Semantic Web
Tool for adaptive IE from Web-related textsn Based on (LP)2
n Uses Gate and Annie for preprocessing n Effective on different text types
n From free texts to rigid docs (XML,HTML, etc.) n Integrated with
n MnM (Open University) Ontomat (University of Karlsruhe)n Gate (U Sheffield)
n Adapting Amilcare:n Define a scenario (ontology)n Define a Corpus of documentsn Annotate texts
n Via MnM, Gate, Ontomatn Train the systemn Tune the application (*)n Deliver the application
[Ciravegna 2002 -SIGIR] www.dcs.shef.ac.uk/~fabio/Amilcare.html
15
Ciravegna & Kushmerick: ECML-2003 Tutorial85
Non- Intrusive Active Learning
n Amilcare is specifically designed as companion for text annotationn It can be inserted in the usual tagging environmentn It works in the backgroundn At some point it will start helping the user in tagging
Ciravegna & Kushmerick: ECML-2003 Tutorial86
Bootstrapping Annotation
Bare text Amilcare LearnsIn the background
UserAnnotates
Bare textUserAnnotates
AmilcareAnnotates
Annotationcomparison
Use missing casesand mistakes to trigger learning
Learning to annotate
Ciravegna & Kushmerick: ECML-2003 Tutorial87
Active Annotation
n When Amilcare’s rules reach a user-defined accuracy
n WHY active annotation?n Focuses the slow and expensive user activity on uncovered
casesn Avoids annotating covered cases n Validating extracted information is
n Simpler & less error proneThan annotating bare texts speeding up the process of corpus
annotation considerably.
Bare text UserCorrects
AmilcareAnnotates
Use corrections to retrain
Ciravegna & Kushmerick: ECML-2003 Tutorial88
Is IE useful as Help for Tagging?Speaker
0
20
40
60
80
100
0 50 100 150
training examples
Stime
0
20
40
60
80
100
0 50 100 150training examples
Etime
0
20
40
60
80
100
0 50 100 150
training examples
Location
0
20
40
60
80
100
0 50 100 150
training examples
Precision Recall F-measure
Ciravegna & Kushmerick: ECML-2003 Tutorial89
Conclusions on IE and Tagging
n Integration of IE (Amilcare+Gate) and Ontology-based Annotation Tools (MnM and Ontomat)
n First step towards a new generation of OEsn Active Learning can provide an interesting interaction
modalityn User friendlyn Adaptable
Tag Amount of Texts needed for training
Prec Rec
stime 30 91 78 etime 20 96 72
location 30 82 61 speaker 100 75 70
Ciravegna & Kushmerick: ECML-2003 Tutorial90
Summary and Conclusions
The summary of the summaryWhere do we go from now?
16
Ciravegna & Kushmerick: ECML-2003 Tutorial91
Summary
n Information extraction: n core enable technology for variety of next-generation information
servicesn Data integration agentsn Semantic Webn Knowledge Management
n Scalable IE systems must be adaptiven automatically learn extraction rules from examples
n Dozens of algorithms to choose fromn State of the art is 70-100% extraction accuracy (after hand-tuning!)
across numerous domains. n Is this good enough? Depends your application.
n Yeah, but does it really work?!n Several companies sell IE products.n SW ontology editors start including IE
Ciravegna & Kushmerick: ECML-2003 Tutorial92
Open issues, Future directions
n Knob-tuning will continue to deliver substantial incremental performance increments
n Grand Unified Theory oftext “structuredness”,to automatically selectoptimal IE algorithm fora given task
natural
machine-generated
formal
spontaneous
restricted
open topic
Ciravegna & Kushmerick: ECML-2003 Tutorial93
Open Issues, Future directions
n Resource Discovery
n Cross-Document Extraction
spideringheuristics
formclassifier
Web
exampleservices
?
??
?
?
?
?
discoveredservices
candidateservices
servicecategory
input1 type
input data typetaxonomy
service categorytaxonomy
input2 type
p[input|service]
p[term|input]
term
1
inputn type
term
m
term
1
term
m… ……
…
3-level Bayesian network
To: MaryFrom: JohnSubject: meet?
Can we meet Tueat 3, and also Friat noon?
To: MaryFrom: JohnSubject: tuesday
Sorry, I’ll be anhour late Tue.
To: MaryFrom: JohnSubject: drat
Oops, I need tocancel on Fri.
3 To: MaryFrom: AliceSubject: meet?
John asked me toalso come to yourTuesday meeting.
Figure 3: Calendar management as inter-document information extraction.
1
when = 28/08@16:00who = {John,Mary}calendar
when = 28/08@15:00who = {John,Mary}
when = 28/08@16:00who = {John,Mary,Alice}
4
2
when = 01/09@12:00who = {John,Mary}
8delete
Ciravegna & Kushmerick: ECML-2003 Tutorial94
Open issues, Future directions
n Adaptive only?n Mentioned systems are designed for non experts
n E.g. do not require users to revise or contribute rules. n Is this a limitation? What about experts or even the
whole spectrum of skills?
n Future direction: making the best use of user’s knowledge
n Expressive enough?n What about filling templates?
n Coreferences(ACME is producing part for YMB Inc. The company will deliver…)
n Reasoning (if X retires then X leaves his/her company)