Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

Preview:

Citation preview

Comparison of Comparison of IE Approaches IE Approaches

Chia-Hui ChangChia-Hui ChangNational Central UniversityNational Central University

Jan. 4, 2005Jan. 4, 2005

IntroductionIntroduction• Abundant information on the Web

– Static Web pages– Searchable databases: Deep Web

• Information Integration– Information for life

• e.g. shopping agents, travel agents

– Data for research purpose• e.g. bioinformatics, auction economy

Various IE SurveyVarious IE Survey• Muslea• Hsu and Dung• Chang• Kushmerick• Laender• Sarawagi• Kuhlins and Tredwell

Related Work: Time Related Work: Time • MUC Approaches

– AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995]

• Post-MUC Approaches – WHISK [Soderland, 1999], RAPIER [califf, 1998],

SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]

Related Work: Automation DegreeRelated Work: Automation Degree

• Hsu and Dung [1998]– hand-crafted wrappers using general

programming languages– specially designed programming

languages or tools– heuristic-based wrappers, and – WI approaches

Related Work: Automation DegreeRelated Work: Automation Degree

• Chang and Kuo [2003]– systems that need programmers, – systems that need annotation examples,– annotation-free systems and – semi-supervised systems

Related Work: Related Work: Input and Extraction RulesInput and Extraction Rules

• Muslea [1999]– IE from free text using extraction patterns that a

re mainly based on syntactic/semantic constraints.

– The second class is Wrapper induction systems which rely on the use of delimiter-based rules.

– The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.

Related Work: Extraction RulesRelated Work: Extraction Rules

• Kushmerick [2003]– Finite-state tools (regular expressions)– Relational learning tools (logic rules)

Related Work: TechniquesRelated Work: Techniques• Laender [2002]

– languages for wrapper development – HTML-aware tools – NLP-based tools – Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER),

– Modeling-based tools – Ontology-based tools

• New Criteria:– degree of automation, support for complex objects, page con

tents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.

Related Work: Output TargetsRelated Work: Output Targets

• Sarawagi [2002]– Record-level– Page-level– Site-level

Related Work: UsabilityRelated Work: Usability • Kuhlins and Tredwell [2002]

– Commercial– Noncommercial

Three DimensionsThree Dimensions• Task Domain

– Input (Unstructured, semi-structured)– Output Targets (record-level, page-level, site-level)

• Automation Degree– Programmer-involved, learning-based or annotatio

n-free approaches• Techniques

– Regular expression rules vs Prolog-like logic rules– Deterministic finite-state transducer vs probabilisti

c hidden Markov models

Classification by Automation DegreeClassification by Automation Degree

• Manually– TSIMMIS, Minerva, WebOQL, W4F, XWrap

• Supervised– WIEN, Stalker, Softmealy

• Semi-supervised– IEPAD, OLERA

• Unsupervised– DeLa, RoadRunner, EXALG

Task Domain: InputTask Domain: Input

Task Domain: OutputTask Domain: Output• Missing Attributes• Multi-valued Attributes• Multiple Permutations• Nested Data Objects• Various Templates for an attribute• Common Templates for various attribut

es• Untokenized Attributes

Tools PT NHS CP EL Nested MA MVA MOA FVF UDA UTA SPA

Manual

Minerva Semi-S Yes Yes Record Level Yes Yes Yes Yes Yes No Yes Yes

TSIMMIS Semi-S Yes Yes Record Level Yes Yes Yes No Yes No Yes No

WebOQL Semi-S No Yes Record Level Yes Yes Yes Yes Yes No No No

W4F Semi-S No Yes Record Level Yes Yes Yes Yes No No No Yes

XWRAP Semi-S No Yes Record Level Yes Yes Yes No No No No Yes

Supervise

d

RAPIER Free Yes Yes Field Level No Yes Yes Yes Yes Yes Yes No

SRV Free Yes Yes Field Level No Yes Yes Yes Yes Yes Yes No

WHISK Free Yes Yes Record Level No Yes Yes Yes Yes Yes Yes No

NoDoSE Semi-S Yes Yes Record Level Yes Yes Yes Yes No No No No

DEByE Semi-S Yes Yes Record Level Yes Yes Yes Yes No No No No

WIEN Semi-S Yes Yes Record Level No No No No No No No No

STALKER Semi-S Yes Yes Record Level Yes Yes Yes Yes Yes No Yes Yes

SoftMealy Semi-S Yes Yes Record LevelMultiPass

Yes Yes Limited Yes No Yes Yes

Semi-

Supervise

d

IEPAD Semi-S No Limited Record Level Limited Yes Yes Limited Yes No Yes Yes

OLERA Semi-S No Limited Record Level Limited Yes Yes Limited Yes No Yes Yes

Un-Supervise

d

RoadRunner Semi-S No Limited Page Level Yes Yes Yes No No No No Yes

EXALG Semi-S Yes Limited Page Level Yes Yes Yes No Yes No No Yes

DeLa Semi-S No Limited Record Level Yes Yes Yes Limited Yes No No Yes

Automation DegreeAutomation Degree• Page-fetching Support• Annotation Requirement• Output Support• API Support

ToolsGUI

support

Page-Fetching support

Output Support

Training Examples

API. Support

Minerva No No XML No Yes

TSIMMIS No No Text No Yes

WebOQL No No Text No Yes

W4F Yes Yes XML Labeled Yes

XWRAP Yes Yes XML Labeled Yes

RAPIER No No Text Labeled No

SRV No No Text Labeled No

WHISK No No Text Labeled No

NoDoSE Yes No XML, OEM Labeled Yes

DEByE Yes Yes XML, SQL DB Labeled Yes

WIEN Yes No Text Labeled Yes

STALKER Yes No Text Labeled Yes

SoftMealy Yes Yes XML, SQL DB Labeled Yes

IEPAD Yes No Text Unlabeled No

OLERA Yes No XML Unlabeled No

RoadRunner No Yes XML Unlabeled Yes

EXALG No No Text Unlabeled No

DeLa No Yes Text Unlabeled Yes

TechnologiesTechnologies• Scan passes• Extraction rule types• Learning algorithms• Tokenization schemes• Feature used

Tools Scan PassExtraction Rule Type

Features Used Learning AlgorithmTokenization Schemes

Minerva Single Regular exp. HTML tags/Literal words None Manually

TSIMMIS Single Regular exp. HTML tags/Literal words None Manually

WebOQL Single Regular exp. Hypertree None Manually

W4F Single Regular exp. DOM tree path addressing None Tag Level

XWRAP Single Context-Free DOM tree None Tag Level

RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level

SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level

WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level

NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

DEByE Single Regular exp. HTML tags/Literal words Data Modeling Word Level

WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level

IEPAD Single Regular exp. HTML tagsPattern Mining, String

AlignmentMulti-Level

OLERA Single Regular exp. HTML tags String Alignment Multi-Level

RoadRunner Single Regular exp. HTML tags String Alignment Tag Level

EXALG Single Regular exp. HTML tags/Literal wordsEquivalent Class and Role Differentiation

Word Level

DeLa Single Regular exp. HTML tags Pattern Mining Tag Level

ConclusionConclusion• Criteria for evaluating IE systems

from the task domain• Comparison of IE systems from

various automation degree• The use of various techniques in IE

systems

Future WorkFuture Work• Page Fetching

– XWrap, W4F, WNDL• Schema Mapping

– Full information– Partial information

• Query Interface Integration– [He, Chang and Han, 2004

ReferencesReferences• C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, Criteria

for Evaluating Web Information Extraction Systems.

Recommended