Upload
frank-bruce
View
216
Download
0
Embed Size (px)
Citation preview
Project Overview
Vangelis Karkaletsis NCSR “Demokritos”
Frascati, July 17, 2002
(IST-2000-25366)(IST-2000-25366)http://www.iit.demokritos.gr/skel/crossmarc/http://www.iit.demokritos.gr/skel/crossmarc/
© CROSSMARC, Frascati, July 17, 2002
CROSSMARC develops commercial strength CROSSMARC develops commercial strength technology for information extraction from technology for information extraction from web pages, thatweb pages, that
employs state of the art language engineering employs state of the art language engineering tools and techniques tools and techniques
can be used to process pages written in several can be used to process pages written in several
languageslanguages can be adapted semi-automatically to new can be adapted semi-automatically to new
product typesproduct types
Objectives
© CROSSMARC, Frascati, July 17, 2002
CROSSMARC Consortium
CC National Centre for Scientific Research National Centre for Scientific Research “Demokritos”“Demokritos”
ELEL
PP VeltiNet A.E.VeltiNet A.E. ELEL
PP University of EdinburghUniversity of Edinburgh UKUK
PP Universita di Roma Tor VergataUniversita di Roma Tor Vergata II
PP Informatique CDCInformatique CDC FF
PP Lingway Lingway FF
Start Date: March 1, 2001, Start Date: March 1, 2001,
End Date: August 31, 2003End Date: August 31, 2003
© CROSSMARC, Frascati, July 17, 2002
CROSSMARC big picture
Domain-specific Web sites
Domain-specific Spidering
Domain Ontology
XHTML pages
WEBFocused Crawling
Web Pages Collection
with NE annotations
NERC-FE
Multilingual NERC and Name Matching
Multilingual and Multimedia Fact
Extraction
XHTML pages
XML pages
Insertion into the data base Products
Database
User Interface End user
© CROSSMARC, Frascati, July 17, 2002
Other CROSSMARC Tools
Corpus Formation Tool Corpus Formation Tool
Corpus collection and annotation methodologyCorpus collection and annotation methodology
Web AnnotatorWeb Annotator
Ontology maintenance toolsOntology maintenance tools
NERC-based Demarcation toolNERC-based Demarcation tool
© CROSSMARC, Frascati, July 17, 2002
add the language-specific lexicon for the domain, add the language-specific lexicon for the domain, according to the lexicon schema and the domain according to the lexicon schema and the domain ontology ontology
train the web spidering tool using the language-specific train the web spidering tool using the language-specific lexicon and corpus lexicon and corpus
use the web annotator to create corpus for training and use the web annotator to create corpus for training and testing the IE systemtesting the IE system
configure the IE system in order to accept as input the configure the IE system in order to accept as input the XHTML pages provided by the spidering toolXHTML pages provided by the spidering tool
configure the IE system in order to output an XML configure the IE system in order to output an XML document for the product description(s) found in the document for the product description(s) found in the page, according to the FE schemapage, according to the FE schema
Connect to the IE remote invocation system (IERI)Connect to the IE remote invocation system (IERI)
Adding an IE systems for a new language ...
© CROSSMARC, Frascati, July 17, 2002
add the new domain-specific ontology and lexicons for add the new domain-specific ontology and lexicons for the languages supported according to the ontology and the languages supported according to the ontology and lexicon schema lexicon schema
add the domain-specific FE schema add the domain-specific FE schema train and test the web spidering tool (page filtering, link train and test the web spidering tool (page filtering, link
scoring) using the provided tools scoring) using the provided tools create the domain-specific training and testing corpus create the domain-specific training and testing corpus
following the corpus collection methodology and using following the corpus collection methodology and using the web annotator tool the web annotator tool
train the monolingual IE systems in the new domain train the monolingual IE systems in the new domain (combining linguistic information with wrapper (combining linguistic information with wrapper induction) induction)
Adding a new domain ...
© CROSSMARC, Frascati, July 17, 2002
Evaluation Task
Web Web SpideringSpidering
Information ExtractionInformation Extraction
End-user interfaceEnd-user interface
© CROSSMARC, Frascati, July 17, 2002
CROSSMARC Evaluation Questionnaire