Upload
shaeleigh-fuller
View
23
Download
0
Embed Size (px)
DESCRIPTION
Integration of Friendly Data Islands on the Web. Information Extraction. Roadmap. Introduction What extraction rules are Generating extraction rules A couple of systems Conclusions. Roadmap. Introduction What extraction rules are Generating extraction rules A couple of systems - PowerPoint PPT Presentation
Citation preview
Integration of Friendly Data Islands on the Web.
Information Extraction.
Roadmap
• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions
Roadmap
• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions
The theory• A wrapper is a building
block that provides an ad-hoc, message-based API to an app
• They interface apps at one or more layers, but, more often than not, they must deal with the user interface or the data layer
User Interface
Controller
Business Logic
Data AccessLayer
Data Layer
The problem
The Da Vinci Code
Buy
Dan BrownDoubleday, 200615.95 €
Robert Langdon is a Harvard Professor of Symbology…
Features of current web documents
• Trillions of documents• Generated on demand by software
applications• Change continuously• Require navigation from search forms• Written in telegraphic language• Formatted according to HTML templates
The solution
Wrapping in a nutshell• Goals
– Endow data islands with APIs
– Ease implementing software applications
• Implications– Form filling– Navigation– Info extraction– “Ontologisation”
Look out!
• Information extraction has driven most research efforts
• Few wrapping systems are complete• Wrapping is usually mistaken for information
extraction• This talk is about engineering information
extraction for enabling information integration
How IE works
Information extractor
Document
Extraction rules
Attributes
The Da Vinci Code
Dan Brown
15.95 €
2006
Robert Langdon…
Doubleday
Templates
Message ID: MUC-0001Message Template: Court resolutionDate of Event: April, 30 2007Charge: Terrorist attackPerpetrator: Salahuddin AminPerpetrator: Anthony GarciaPerpetrator: Waheed MahmoodPerpetrator: Omar Khyam…
The Da Vinci Code
Dan Brown
15.95 €
2006
P1
Robert Langdon…
Doubleday
A1
B1
Ontology instances
Templating/ Ontologisation rules
Roadmap
• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Side by side comparison• Conclusions
Running example
Running example<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>
<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>
<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>
Kinds of extraction rules
• Regular expressions • First-order logic rules • Pointers into DOM tree • Context-free grammars • Tag trees
TSIMMISTSIMMIS
Regular expressions
[Root, get("page.html"), "#"]
[BookReview, Root, "<body>#</body>"]
[BookName, BookReview, "</b>#<br/>"]
[Tmp, Rook, "<ul>#</ul>"]
[Reviews, Tmp, "split(Tmp, '<li>')"]
[ReviewerNames, Reviews, "Reviewer:</b>#<br/>"]
[Ratings, Reviews, "Rating:</b>#<br/>"]
[Text, Reviews, "Text:</b>#<br/>"]
[Root, get("page.html"), "#"]
[BookReview, Root, "<body>#</body>"]
[BookName, BookReview, "</b>#<br/>"]
[Tmp, Rook, "<ul>#</ul>"]
[Reviews, Tmp, "split(Tmp, '<li>')"]
[ReviewerNames, Reviews, "Reviewer:</b>#<br/>"]
[Ratings, Reviews, "Rating:</b>#<br/>"]
[Text, Reviews, "Text:</b>#<br/>"]
RoadRunnerRoadRunner
$FileName<html><body> <b>Book name:</b> $BookTitle <br/> <b>Reviews:</b> <br/> <ul> (( <li> <b>Reviewer:</b> $ReviewerName <br/> <b>Rating:</b> $Rating <br/> <b>Text:</b> $Text </li> )+)? </ul></body></html>
$FileName<html><body> <b>Book name:</b> $BookTitle <br/> <b>Reviews:</b> <br/> <ul> (( <li> <b>Reviewer:</b> $ReviewerName <br/> <b>Rating:</b> $Rating <br/> <b>Text:</b> $Text </li> )+)? </ul></body></html>
First-order logic rules
SRVSRV
bookTitle(X) :- prev(X, "Book name:</b>"), next(X, "<br/>").
reviewerName(X) :- prev(X, "name:</b>"),next(X, "<br/>"), !bookTitle(X).
rating(X) :- isNatural(X), length(X, 1), inList(X).
text(X) :- prev(X, "Text:</b>"),next(X, "</li>").
bookTitle(X) :- prev(X, "Book name:</b>"), next(X, "<br/>").
reviewerName(X) :- prev(X, "name:</b>"),next(X, "<br/>"), !bookTitle(X).
rating(X) :- isNatural(X), length(X, 1), inList(X).
text(X) :- prev(X, "Text:</b>"),next(X, "</li>").
Pointer into the DOM tree
WebOQLWebOQL
select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Textfrom x, y in browse("page.html")where x.Text = "Book name:" and y.Text = "Reviewer:"
select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Textfrom x, y in browse("page.html")where x.Text = "Book name:" and y.Text = "Reviewer:"
Context-free grammars
MinervaMinerva
Page ::= $FileName <html><body> Review </body></html>
Review ::= <b>Book name:</b> $BookName <br/> <b>Reviews:</b> <br/> <ul> (<li> Reviewer Rating Text <li>)* </ul>
Reviewer ::= <b>Reviewer:</b> $Reviewer <br/>
Rating ::= <b>Rating:</b> $Rating <br/>
Text ::= <b>Text:</b> $Text
Page ::= $FileName <html><body> Review </body></html>
Review ::= <b>Book name:</b> $BookName <br/> <b>Reviews:</b> <br/> <ul> (<li> Reviewer Rating Text <li>)* </ul>
Reviewer ::= <b>Reviewer:</b> $Reviewer <br/>
Rating ::= <b>Rating:</b> $Rating <br/>
Text ::= <b>Text:</b> $Text
DEPTADEPTA
Tag trees
li
b b bbr br
Roadmap
• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions
Classification
• Hand-crafted• Supervised induction• Little-supervised induction• Unsupervised induction
Hand-crafted
The pattern to extract the title is
“…”
• Techniques– Natural intelligence
• Systems– TSIMMIS– Minerva– WebOQL– W4F– XWrap
Supervised induction • Techniques
– Bottom-up ILP– Top-down ILP– Ad-hoc algorithms
• Systems– SRV– RAPIER– WIEN– WHISK– NoDoSE– SoftMealy– STALKER– DEByE
Raw documents
Labelled documents
Automated induction
Little-supervised induction • Techniques
– String alignment– Tree alignment
• Systems– OLERA– Thresher
Raw document
Record and attribute labelling
Automated induction
Unsupervised induction • Techniques
– String alignment– Tree alignment– Statistical roles
• Systems– DeLa– RoadRunner– EXALG– DEPTA– IEPAD
Raw documents
Automated induction
Pattern interpretation
Roadmap
• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions
Roadmap
• Introduction• What extraction rules are• Generating extraction rules• A couple of systems
– RoadRunner– SRV
• Conclusions
Token matching<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
String mistmatch
$1$1
...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
Tag match
$1<html>
$1<html>
...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
Tag match
$1<html><body>
$1<html><body>
...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
Tag match, string match, …
$1<html><body> <b>Book name:</b>
$1<html><body> <b>Book name:</b>
...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
String mismatch, tag match
$1<html><body> <b>Book name:</b> $2 <br/>
$1<html><body> <b>Book name:</b> $2 <br/>
...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
…
$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>
$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>
Stop: lists and optionals<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
Tag mismatch
$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>
$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>
Stop: lists and optionals<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>
$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>
Stop: lists and optionals<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+
$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+
…and matching finishes<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+ </ul></body></html>
$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+ </ul></body></html>
Just union-free grammars!
Roadmap
• Introduction• What extraction rules are• Generating extraction rules• A couple of systems
– RoadRunner– SRV
• Conclusions
Exercise
• Support predicates: next(x,y), previous(x,y)• Try to explain isCorD(X)
abcabdabbbcaabda
Exercise
• Support Predicates: next(x,y), previous(x,y)• Now, try to Explain isCorDorE(X)
abcabdabeebbcaabdaee
Target PredicatesTarget Predicates
Define target predicates
title: #PCDATA.
reviewer: #PCDATA.
rating: #PCDATA.
text: #PCDATA.
title: #PCDATA.
reviewer: #PCDATA.
rating: #PCDATA.
text: #PCDATA.
Instantiate target predicates<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>
<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>
<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>
<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>
<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>
Instantiate target predicatesPositive SamplesPositive Samples
title("Ontologies").
title("SPARQL in action").
title("W4F Explained").
reviewer("John Doe").
reviewer("Alan Wohl").
reviewer("Dan Smith").
rating("7").
rating("8").
rating("9").
text("blah, blah").
text("yeah, yeah").
text("cough, cough").
title("Ontologies").
title("SPARQL in action").
title("W4F Explained").
reviewer("John Doe").
reviewer("Alan Wohl").
reviewer("Dan Smith").
rating("7").
rating("8").
rating("9").
text("blah, blah").
text("yeah, yeah").
text("cough, cough").
Negative Samples Negative Samples
!title("Book name:").
!reviewer("Book name:").
!rating("Book name:").
!text("Book name:").
!title("Reviews:").
!reviewer("Reviews:").
!rating("Reviews:").
!text("Reviews:").
!title("Reviewer:").
!reviewer("Reviewer:").
!rating("Reviewer:").
!text("Reviewer:").
!title("Rating:").
!reviewer("Rating:").
!rating("Rating:").
…
!title("Book name:").
!reviewer("Book name:").
!rating("Book name:").
!text("Book name:").
!title("Reviews:").
!reviewer("Reviews:").
!rating("Reviews:").
!text("Reviews:").
!title("Reviewer:").
!reviewer("Reviewer:").
!rating("Reviewer:").
!text("Reviewer:").
!title("Rating:").
!reviewer("Rating:").
!rating("Rating:").
…
Support PredicatesSupport Predicates
Define support predicates
prev: #PCDATA, #PCDATA.
next: #PCDATA, #PCDATA.
length: #PCDATA, #PCDATA.
isNatural: #PCDATA.
prev: #PCDATA, #PCDATA.
next: #PCDATA, #PCDATA.
length: #PCDATA, #PCDATA.
isNatural: #PCDATA.
Instantiate support predicatesOn Positive SamplesOn Positive Samples
prev("Ontologies", "</b>").
next("Ontologies", "<br/>").
length("Ontologies", 10).
!isNatural("Ontologies").
prev("SPARQL in action", "</b>").
next("SPARQL in action", "<br/>").
length("SPARQL in action", 16).
!isNatural("SPARQL in action").
prev("W4F explained", "</b>").
next("W4F explained", "<br/>").
length("W4F explained", 16).
!isNatural("W4F explained").
…
prev("Ontologies", "</b>").
next("Ontologies", "<br/>").
length("Ontologies", 10).
!isNatural("Ontologies").
prev("SPARQL in action", "</b>").
next("SPARQL in action", "<br/>").
length("SPARQL in action", 16).
!isNatural("SPARQL in action").
prev("W4F explained", "</b>").
next("W4F explained", "<br/>").
length("W4F explained", 16).
!isNatural("W4F explained").
…
On Negative SamplesOn Negative Samples
prev("Book name:", "<b>").
next("Book name:", "</b>").
length("Book name:", 10).
!isNatural("Book name:").
prev("Reviews:", "<b>").
next("Reviews:", "</b>").
!isNatural("Reviews:").
prev("Reviewer:", "<b>").
next("Reviewer:", "</b>").
!isNatural("Reviewer:").
prev("Rating:", "<b>").
next("Rating:", "</b>").
!isNatural("Rating:").
…
prev("Book name:", "<b>").
next("Book name:", "</b>").
length("Book name:", 10).
!isNatural("Book name:").
prev("Reviews:", "<b>").
next("Reviews:", "</b>").
!isNatural("Reviews:").
prev("Reviewer:", "<b>").
next("Reviewer:", "</b>").
!isNatural("Reviewer:").
prev("Rating:", "<b>").
next("Rating:", "</b>").
!isNatural("Rating:").
…
…
Top-down inductiontitle(X) :- . (3, 14)title(X) :- . (3, 14)
title(X) :- prev(X, X). (0, 0)title(X) :- prev(X, X). (0, 0)
title(X) :- !prev(X, X). (3, 14)title(X) :- !prev(X, X). (3, 14)
title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)
title(X) :- !prev(X, Y). (?, ?)title(X) :- !prev(X, Y). (?, ?)
title(X) :- next(X, X). (0, 0)title(X) :- next(X, X). (0, 0)
title(X) :- !next(X, X). (3, 14)title(X) :- !next(X, X). (3, 14)
title(X) :- next(X, Y). (3, 14)title(X) :- next(X, Y). (3, 14)
title(X) :- !next(X, Y). (?, ?)title(X) :- !next(X, Y). (?, ?)
title(X) :- length(X, X). (0, 0)title(X) :- length(X, X). (0, 0)
title(X) :- prev(X, "<b>"). (0, 5)title(X) :- prev(X, "<b>"). (0, 5)
title(X) :- !prev(X, "<b>"). (3, 9)title(X) :- !prev(X, "<b>"). (3, 9)
title(X) :- prev(X, "</b>"). (3, 9)title(X) :- prev(X, "</b>"). (3, 9)
title(X) :- !prev(X, "</b>"). (0, 5)title(X) :- !prev(X, "</b>"). (0, 5)
…
Rule selection
00
0
11
1 lnlnnp
p
np
ptGain
p0 = # positive bindings of R
n0 = # negative bindings of R
p1 = # positive bindings of R&A
n0 = # negative bindings of R&A
t = # positive bindings of both R and R&A
New covering Old coveringCombined covering
Induction goes on…title(X) :- . (3, 14)title(X) :- . (3, 14)
title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)
title(X) :- prev(X, Y), X = Y. (?, ?)title(X) :- prev(X, Y), X = Y. (?, ?)
title(X) :- prev(X, Y), X != Y. (?, ?)title(X) :- prev(X, Y), X != Y. (?, ?)
title(X) :- prev(X, Y), prev(X, X). (?, ?)title(X) :- prev(X, Y), prev(X, X). (?, ?)
title(X) :- prev(X, Y), !prev(X, X). (?, ?)title(X) :- prev(X, Y), !prev(X, X). (?, ?)
title(X) :- prev(X, Y), prev(X, Z). (?, ?)title(X) :- prev(X, Y), prev(X, Z). (?, ?)
title(X) :- prev(X, Y), !prev(X, Z). (?, ?)title(X) :- prev(X, Y), !prev(X, Z). (?, ?)
title(X) :- prev(X, Y), prev(Y, X). (?, ?)title(X) :- prev(X, Y), prev(Y, X). (?, ?)
…
…and on…title(X) :- . (3, 14)title(X) :- . (3, 14)
title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)
title(X) :- prev(X, Y), Y = "</b>". (?, ?)title(X) :- prev(X, Y), Y = "</b>". (?, ?)
title(X) :- prev(X, Y), Y = "</b>", prev(X, X). (?, ?)title(X) :- prev(X, Y), Y = "</b>", prev(X, X). (?, ?)
title(X) :- prev(X, Y), Y = "</b>", !prev(X, X). (?, ?)title(X) :- prev(X, Y), Y = "</b>", !prev(X, X). (?, ?)
title(X) :- prev(X, Y), Y = "</b>", prev(Y, Y). (?, ?)title(X) :- prev(X, Y), Y = "</b>", prev(Y, Y). (?, ?)
title(X) :- prev(X, Y), Y = "</b>", !prev(Y, Y). (?, ?)title(X) :- prev(X, Y), Y = "</b>", !prev(Y, Y). (?, ?)
title(X) :- prev(X, Y), Y = "</b>", prev(X, Z). (?, ?)title(X) :- prev(X, Y), Y = "</b>", prev(X, Z). (?, ?)
title(X) :- prev(X, Y), Y = "</b>", !prev(X, Z). (?, ?)title(X) :- prev(X, Y), Y = "</b>", !prev(X, Z). (?, ?)
…
…and eventually finishestitle(X) :- . (3, 14)title(X) :- . (3, 14)
title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)
title(X) :- prev(X, Y), Y = "</b>". (?, ?)title(X) :- prev(X, Y), Y = "</b>". (?, ?)
title(X) :- prev(X, Y), Y = "</b>", prev(Y, "Book name:"). (3, 0)title(X) :- prev(X, Y), Y = "</b>", prev(Y, "Book name:"). (3, 0)
Optimisations
• Intelligent predicates– Non-sense atoms– Non-sense atom combinations– Non-bindable variables
• Instantiated target predicates• Statistical analysis of constants• Keep track of non-instantiable predicates
Roadmap
• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions
That's quite clear!
• Information extraction enables information integration
Research challenges
• Information extraction– Efficient rule generation– Maintaining rules automatically– Union non-free Grammars (unsupervised)
• Ontologisation rules– Everything is a challenge
Thanks!
Drop by our web site at http://www.tdg-seville.info