View
221
Download
1
Category
Tags:
Preview:
Citation preview
1
Schema-Guided Wrapper Maintenance for Web-Data ExtractionXiaofeng Meng, Dongdong Hu
Renmin University of China, Beijing, ChinaChen Li
University of California, Irvine, CA, USA
2
Wrappers for Web Sources Extract information from Web pages Used in many Web-based applications
HTML Documents Wrapper
Wrapper
Wrapper
………
RDBMS
………
Application(e.g., data
Integration)
Programs
XML
3
Problem The Web are very dynamic: contents, page structures Original wrappers can stop working: rely on Web page
structures Re-generating wrappers is not easy: heavy workload to
system developers
ChangedDocuments Original Wrapper
Original Wrapper
Original Wrapper
……… ………
Extract nothing …
Incomplete results
Incorrect results
4
Example
The original wrapper fails due to the structure change.
5
Problems
Wrapper verification: Is a wrapper is operating correctly? Several studies have been conducted on the
verification problem: E.g., computing the similarity between a wrapper’s
expected and observed output, “regression test” Wrapper maintenance: how to automatically
modify a wrapper when the pages have changed? Focus of this work
6
Outline
Motivation System overview Schema-Guided Wrapper Maintenance Experiments Related Work and Conclusion
7
The SG-WRAM System
Wrapper Maintainer
Wrapper Generator
Wrapper Executor
Data Feature Discovery
Data Item Recovery
Block Configuration
Rule Re-induction
DocumentsChanged
Documents
XML Repository
RuleSchema
Wrapper
8
User-Defined Schema
<!ELEMENT VideoList (Video+)>
<!ELEMENT Video (Name, Director, Actors, Price)>
<!ELEMENT Name (#PCDATA)><!ELEMENT Director (#PCDATA)><!ELEMENT Actors (#PCDATA)>
<!ELEMENT Price (VHSPrice, DVDPrice)>
<!ELEMENT VHSPrice (#PCDATA)><!ELEMENT DVDPrice (#PCDATA)>
User provides schema for the target data
9
Schema-Guided Wrapper Generation Using a GUI toolkit, users can map data items in
HTML pages to elements in DTD
HTML page DTD tree
10
Schema-Guided Wrapper Generation
HTML tree
DTD tree
Internally, the system computes the mappings from the corresponding HTML tree to the DTD tree
Then generates the extraction rule
11
Expressing Extraction Rule in XQuery Each rule is an FLWR XQuery expression
FOR $vedio IN $vedioList/body/div[0]/table[4]/tr[0]/td[2]/table/tr[0] /td[1]
RETURN <vedio> { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name> } </vedio>
Paths to the data items
Value of the data item
Example
12
Annotations for data items Describe the semantic meaning of a data item Indicate the location of the data item Specified by the user using the GUI Recorded in the function of “contains(pathToAnnotation,
annotationValue)” in XPath
Data values in HTML page Annotations
May Morning -
Ugo Liberatore directed by
Jane Birkin; John Steiner; Rosella Falk Featuring
15.38-23.26 DVD
14.98-18.99 VHS
/body/div[0]/table[4]/tr[0]/td[2]/table[1]/tr[0]/td[1]/text()[0][contains(null,"directed by")]
13
Outline
Motivation System Overview Wrapper Maintenance (four steps):
Data-Feature Discovery Item Recovery Block Configuration Rule Re-induction
Experiments Related Work and Conclusion
14
Intuition of the approach
The page structure could change Observation: many “features” of data items
are more static, e.g.: Hyperlink Annotation Pattern
These features can help us find the new places of the old data items
15
Step 1: Data-feature discovery Compute features of the data items in the original page
ID DTD Element L (hyperlink) A (annotation) P (data pattern)
1 Name True NULL [A-Z][a-z]{0,}
2 Director False Directed by [A-Z][a-z]{0,}
3 Actors False Featuring [A-Z][a-z]{0,}(.)*
4 VHSPrice False VHS [$][0-9]{0,}[0-9](.)[0-9]{2}
5 DVDPrice False DVD [$][0-9]{0,}[0-9](.)[0-9]{2}
16
Data-Pattern Feature
A syntactic feature Represented as a regular expression
E.g. $ 15.38 [$][0-9]{0,}[0-9](.)[0-9]{2} Can be extracted using existing technologies,
e.g., [Brin98], [GHQR98], [LM00]
17
Annotations and Hyperlinks
Get annotation and hyperlink information from the original page Checking the XQuery based
extraction rule Hyperlink: step of “…/a/…”
in the path Annotation: function of
“contains()”
{ LET $actors = $vedio/text()[contains(
/preceding-sibling::b[0],"Featuring")] RETURN <actors> $actors </actors>}
{ LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN <name> $name </name>}
Hyperlink Indication
Annotation ValuePath from data item to annotation
18
Step 2: Data-Item Recovery Traverse the new HTML tree following the
depth-first traversal order Use the old features to identify potential data
items using 3 matching conditions: Hyperlink Annotation Data pattern
19
Example
Check hyperlink
Check data pattern
ok okRecognize a data item
Find annotation
yes Find value starting from
annotation
Check data pattern
Recognize
a data item
[$][0-9]{0,}[0-9](.)[0-9]{2}
[A-Z][a-z]{0,}
20
Results of Data Item Recovery
A mapping list including all the recognized data items
Each mapping contains Value of the data item Path to it in the HTML tree Path of the corresponding
DTD element
A sample mapping:M1’ (D: “May”,HP: …/table[0]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0],SP: VideoList/Video/Name )
21
Step 3: Block Configuration Observation: Data items are located in semantic blocks Conforms to the user-defined schema Data items are grouped in semantic blocks
Over-Match
Full-MatchPartial-Match
22
Computing “Full Match” Blocks
Identify the level in a top-down manner Check the level by recursively considering
the matches between candidate blocks and the schema
“Full match” blocks
23
Results of Block Configuration A set of blocks that can fully match with the DTD Each of them is represented as a list of mappings
No. Element PATH
1 Title …table[1]/tr[0] /td[1]/span[0]/b[0]/a[0]/text()[0]
2 Director …table[1]/tr[0]/ /td[1]/span[1]/text[contains( /preceding-sibling::b[0],"Directed by")]
3 Actors …table[1]/tr[0]/ /td[1]/span[2]/text()[contains(/preceding-sibling::b[0],"Featuring")]
4 Title …table[2]/tr[0] /td[1]/span[0]/b[0]/a[0]/text()[0]
5 Director …table[2]/tr[0]/ /td[1]/span[1]/text[contains( /preceding-sibling::b[0],"Directed by")]
6 Actors …table[2]/tr[0]/ /td[1]/span[2]/text()[contains(/preceding-sibling::b[0],"Featuring")]
Examples
24
Step 4: Rule Re-Induction
Semantic blocks contain mappings from data items in HTML to DTD elements
Induce new extraction rule by calling the induction algorithm in wrapper generator
Refine the rule by trying to ensure the extraction rule cover all other semantic blocks Generalization is necessary
25
Outline
Motivation System Overview Wrapper Maintenance (four steps):
Data-Feature Discovery Item Recovery Block Configuration Rule Re-induction
Experiments Related Work and Conclusion
26
Web Sources
From October 2002 to May 2003 Collected Web page changes
From 16 data-intensive sites Using site search engine or from the
same URL All the pages have complex table
structures Observed changes
Data items (add, delete, modify) Table structure non-table structure Complex table structure re-
arrangement
1Bookstreet Book
Allbooks4less Book
Amazon Book (search)
Amazon Magazine
Barnesandnoble Book
CIA Factbook
CNN Currency
Excite Currency
Hotels Hotel
Yahoo Shopping Video
Yahoo Quotes
Yahoo People Email
27
Experiment Procedures
Wrapper Repository
New Web Docs
Original Web Docs
Check Extraction
Results
WrapperGenerator
WrapperMaintainer
Changed pages
Repaired
Wrappers
Original
Wrappers
………
step1
step2
step3
28
Experiment Metrics
Recall (R) Proportion of the correctly extracted data items of
all the data items that should be extracted Precision (P)
Proportion of the correctly extracted data items of all the data items that have been extracted
29
Original wrappers after changes
Name# of changed
pagesItem
Number Avg Recall Avg Precision
1Bookstreet Book 12 6 82.54 100
Allbooks4less Book 15 4 0 -
Amazon Book (search) 15 6 40.49 100
Amazon Magazine 15 5 20.01 100
Barnesandnoble Book 15 5 0 100
CIA Factbook 5 10 0 100
CNN Currency 15 6 50.00 100
Excite Currency 18 11 42.86 100
Hotels Hotel 15 4 0 -
Yahoo Shopping Video 15 6 0 -
Yahoo Quotes 10 6 0 -
Yahoo People Email 10 3 0 -
30
New wrappers (after item recovery) Web site Avg Recall Avg Precision
1Bookstreet Book 98.67 71.26
Allbooks4less Book 75 32.69
Amazon Book (search) 83.05 36.3
Amazon Magazine 100 60.15
Barnesandnoble 78.72 43.13
CIA Factbook 100 100
CNN Currency 100 100
Excite Currency 100 100
Hotels Hotel 50 35.61
Yahoo Shopping 100 51.49
Yahoo Quotes 100 100
Yahoo People 100 53.54
31
New Wrappers (final)Web site Avg recall Avg precision
1Bookstreet Book 100 100
Allbooks4less Book 75 51.34
Amazon Book (search) 83.05 90.74
Amazon Magazine 100 100
Barnesandnoble 78.72 100
CIA Factbook 100 100
CNN Currency 100 100
Excite Currency 100 100
Hotels Hotel 50 41.87
Yahoo Shopping 100 92.86
Yahoo Quotes 100 100
Yahoo People 100 100
32
Related Work on Wrapper Maintenance [Kushmerick 99]
Using simple numeric features of the extracted strings [Lerman K., Minton S. 00]
Using the starting and ending strings as the description of the data fields
[Chidlovskii B. 01] Syntactic features of data items to be extracted, and
semantic features: URL, time strings, entities…
33
Comparions
Title Our Price List Price
Data on Web $23.00 $29.00
Java Programming $49.00 $59.00
These approaches heavily rely on the syntactic features of the data items, and may not precisely recognize data items.
Title List Price Our Price
Data on Web $29.00 $23.00
Java Programming $59.00 $49.00
34
Conclusion
SG-WRAM: a wrapper-maintenance system Intuition: use features that are more stable
Pattern Hyperlink Annotation
Four steps of the approach: Data-Feature Discovery Item Recovery Block Configuration Rule Re-induction
Experiments showed that it is effective
35
Thank you!
Schema-Guided Wrapper Maintenance for Web-Data
Extraction Xiaofeng Meng, Dongdong Hu
Renmin University of China, Beijing, ChinaChen Li
University of California, Irvine, CA, USA
Recommended