23
Comparison of Comparison of IE Approaches IE Approaches Chia-Hui Chang Chia-Hui Chang National Central University National Central University Jan. 4, 2005 Jan. 4, 2005

Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

  • View
    218

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Comparison of Comparison of IE Approaches IE Approaches

Chia-Hui ChangChia-Hui ChangNational Central UniversityNational Central University

Jan. 4, 2005Jan. 4, 2005

Page 2: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

IntroductionIntroduction• Abundant information on the Web

– Static Web pages– Searchable databases: Deep Web

• Information Integration– Information for life

• e.g. shopping agents, travel agents

– Data for research purpose• e.g. bioinformatics, auction economy

Page 3: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Various IE SurveyVarious IE Survey• Muslea• Hsu and Dung• Chang• Kushmerick• Laender• Sarawagi• Kuhlins and Tredwell

Page 4: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Related Work: Time Related Work: Time • MUC Approaches

– AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995]

• Post-MUC Approaches – WHISK [Soderland, 1999], RAPIER [califf, 1998],

SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]

Page 5: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Related Work: Automation DegreeRelated Work: Automation Degree

• Hsu and Dung [1998]– hand-crafted wrappers using general

programming languages– specially designed programming

languages or tools– heuristic-based wrappers, and – WI approaches

Page 6: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Related Work: Automation DegreeRelated Work: Automation Degree

• Chang and Kuo [2003]– systems that need programmers, – systems that need annotation examples,– annotation-free systems and – semi-supervised systems

Page 7: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Related Work: Related Work: Input and Extraction RulesInput and Extraction Rules

• Muslea [1999]– IE from free text using extraction patterns that a

re mainly based on syntactic/semantic constraints.

– The second class is Wrapper induction systems which rely on the use of delimiter-based rules.

– The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.

Page 8: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Related Work: Extraction RulesRelated Work: Extraction Rules

• Kushmerick [2003]– Finite-state tools (regular expressions)– Relational learning tools (logic rules)

Page 9: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Related Work: TechniquesRelated Work: Techniques• Laender [2002]

– languages for wrapper development – HTML-aware tools – NLP-based tools – Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER),

– Modeling-based tools – Ontology-based tools

• New Criteria:– degree of automation, support for complex objects, page con

tents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.

Page 10: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Related Work: Output TargetsRelated Work: Output Targets

• Sarawagi [2002]– Record-level– Page-level– Site-level

Page 11: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Related Work: UsabilityRelated Work: Usability • Kuhlins and Tredwell [2002]

– Commercial– Noncommercial

Page 12: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Three DimensionsThree Dimensions• Task Domain

– Input (Unstructured, semi-structured)– Output Targets (record-level, page-level, site-level)

• Automation Degree– Programmer-involved, learning-based or annotatio

n-free approaches• Techniques

– Regular expression rules vs Prolog-like logic rules– Deterministic finite-state transducer vs probabilisti

c hidden Markov models

Page 13: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Classification by Automation DegreeClassification by Automation Degree

• Manually– TSIMMIS, Minerva, WebOQL, W4F, XWrap

• Supervised– WIEN, Stalker, Softmealy

• Semi-supervised– IEPAD, OLERA

• Unsupervised– DeLa, RoadRunner, EXALG

Page 14: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Task Domain: InputTask Domain: Input

Page 15: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Task Domain: OutputTask Domain: Output• Missing Attributes• Multi-valued Attributes• Multiple Permutations• Nested Data Objects• Various Templates for an attribute• Common Templates for various attribut

es• Untokenized Attributes

Page 16: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Tools PT NHS CP EL Nested MA MVA MOA FVF UDA UTA SPA

Manual

Minerva Semi-S Yes Yes Record Level Yes Yes Yes Yes Yes No Yes Yes

TSIMMIS Semi-S Yes Yes Record Level Yes Yes Yes No Yes No Yes No

WebOQL Semi-S No Yes Record Level Yes Yes Yes Yes Yes No No No

W4F Semi-S No Yes Record Level Yes Yes Yes Yes No No No Yes

XWRAP Semi-S No Yes Record Level Yes Yes Yes No No No No Yes

Supervise

d

RAPIER Free Yes Yes Field Level No Yes Yes Yes Yes Yes Yes No

SRV Free Yes Yes Field Level No Yes Yes Yes Yes Yes Yes No

WHISK Free Yes Yes Record Level No Yes Yes Yes Yes Yes Yes No

NoDoSE Semi-S Yes Yes Record Level Yes Yes Yes Yes No No No No

DEByE Semi-S Yes Yes Record Level Yes Yes Yes Yes No No No No

WIEN Semi-S Yes Yes Record Level No No No No No No No No

STALKER Semi-S Yes Yes Record Level Yes Yes Yes Yes Yes No Yes Yes

SoftMealy Semi-S Yes Yes Record LevelMultiPass

Yes Yes Limited Yes No Yes Yes

Semi-

Supervise

d

IEPAD Semi-S No Limited Record Level Limited Yes Yes Limited Yes No Yes Yes

OLERA Semi-S No Limited Record Level Limited Yes Yes Limited Yes No Yes Yes

Un-Supervise

d

RoadRunner Semi-S No Limited Page Level Yes Yes Yes No No No No Yes

EXALG Semi-S Yes Limited Page Level Yes Yes Yes No Yes No No Yes

DeLa Semi-S No Limited Record Level Yes Yes Yes Limited Yes No No Yes

Page 17: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Automation DegreeAutomation Degree• Page-fetching Support• Annotation Requirement• Output Support• API Support

Page 18: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

ToolsGUI

support

Page-Fetching support

Output Support

Training Examples

API. Support

Minerva No No XML No Yes

TSIMMIS No No Text No Yes

WebOQL No No Text No Yes

W4F Yes Yes XML Labeled Yes

XWRAP Yes Yes XML Labeled Yes

RAPIER No No Text Labeled No

SRV No No Text Labeled No

WHISK No No Text Labeled No

NoDoSE Yes No XML, OEM Labeled Yes

DEByE Yes Yes XML, SQL DB Labeled Yes

WIEN Yes No Text Labeled Yes

STALKER Yes No Text Labeled Yes

SoftMealy Yes Yes XML, SQL DB Labeled Yes

IEPAD Yes No Text Unlabeled No

OLERA Yes No XML Unlabeled No

RoadRunner No Yes XML Unlabeled Yes

EXALG No No Text Unlabeled No

DeLa No Yes Text Unlabeled Yes

Page 19: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

TechnologiesTechnologies• Scan passes• Extraction rule types• Learning algorithms• Tokenization schemes• Feature used

Page 20: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Tools Scan PassExtraction Rule Type

Features Used Learning AlgorithmTokenization Schemes

Minerva Single Regular exp. HTML tags/Literal words None Manually

TSIMMIS Single Regular exp. HTML tags/Literal words None Manually

WebOQL Single Regular exp. Hypertree None Manually

W4F Single Regular exp. DOM tree path addressing None Tag Level

XWRAP Single Context-Free DOM tree None Tag Level

RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level

SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level

WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level

NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

DEByE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

IEPAD Single Regular exp. HTML tagsPattern Mining, String

AlignmentMulti-Level

OLERA Single Regular exp. HTML tags String Alignment Multi-Level

RoadRunner Single Regular exp. HTML tags String Alignment Tag Level

EXALG Single Regular exp. HTML tags/Literal wordsEquivalent Class and Role Differentiation

Word Level

DeLa Single Regular exp. HTML tags Pattern Mining Tag Level

Page 21: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

ConclusionConclusion• Criteria for evaluating IE systems

from the task domain• Comparison of IE systems from

various automation degree• The use of various techniques in IE

systems

Page 22: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Future WorkFuture Work• Page Fetching

– XWrap, W4F, WNDL• Schema Mapping

– Full information– Partial information

• Query Interface Integration– [He, Chang and Han, 2004

Page 23: Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

ReferencesReferences• C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, Criteria

for Evaluating Web Information Extraction Systems.