9
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366) (IST-2000-25366) http://www.iit.demokritos.gr/skel/crossmarc/ http://www.iit.demokritos.gr/skel/crossmarc/

Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366)

Embed Size (px)

Citation preview

Page 1: Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366)

Project Overview

Vangelis Karkaletsis NCSR “Demokritos”

Frascati, July 17, 2002

(IST-2000-25366)(IST-2000-25366)http://www.iit.demokritos.gr/skel/crossmarc/http://www.iit.demokritos.gr/skel/crossmarc/

Page 2: Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366)

© CROSSMARC, Frascati, July 17, 2002

CROSSMARC develops commercial strength CROSSMARC develops commercial strength technology for information extraction from technology for information extraction from web pages, thatweb pages, that

employs state of the art language engineering employs state of the art language engineering tools and techniques tools and techniques

can be used to process pages written in several can be used to process pages written in several

languageslanguages can be adapted semi-automatically to new can be adapted semi-automatically to new

product typesproduct types

Objectives

Page 3: Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366)

© CROSSMARC, Frascati, July 17, 2002

CROSSMARC Consortium

CC National Centre for Scientific Research National Centre for Scientific Research “Demokritos”“Demokritos”

ELEL

PP VeltiNet A.E.VeltiNet A.E. ELEL

PP University of EdinburghUniversity of Edinburgh UKUK

PP Universita di Roma Tor VergataUniversita di Roma Tor Vergata II

PP Informatique CDCInformatique CDC FF

PP Lingway Lingway FF

Start Date: March 1, 2001, Start Date: March 1, 2001,

End Date: August 31, 2003End Date: August 31, 2003

Page 4: Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366)

© CROSSMARC, Frascati, July 17, 2002

CROSSMARC big picture

Domain-specific Web sites

Domain-specific Spidering

Domain Ontology

XHTML pages

WEBFocused Crawling

Web Pages Collection

with NE annotations

NERC-FE

Multilingual NERC and Name Matching

Multilingual and Multimedia Fact

Extraction

XHTML pages

XML pages

Insertion into the data base Products

Database

User Interface End user

Page 5: Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366)

© CROSSMARC, Frascati, July 17, 2002

Other CROSSMARC Tools

Corpus Formation Tool Corpus Formation Tool

Corpus collection and annotation methodologyCorpus collection and annotation methodology

Web AnnotatorWeb Annotator

Ontology maintenance toolsOntology maintenance tools

NERC-based Demarcation toolNERC-based Demarcation tool

Page 6: Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366)

© CROSSMARC, Frascati, July 17, 2002

add the language-specific lexicon for the domain, add the language-specific lexicon for the domain, according to the lexicon schema and the domain according to the lexicon schema and the domain ontology ontology

train the web spidering tool using the language-specific train the web spidering tool using the language-specific lexicon and corpus lexicon and corpus

use the web annotator to create corpus for training and use the web annotator to create corpus for training and testing the IE systemtesting the IE system

configure the IE system in order to accept as input the configure the IE system in order to accept as input the XHTML pages provided by the spidering toolXHTML pages provided by the spidering tool

configure the IE system in order to output an XML configure the IE system in order to output an XML document for the product description(s) found in the document for the product description(s) found in the page, according to the FE schemapage, according to the FE schema

Connect to the IE remote invocation system (IERI)Connect to the IE remote invocation system (IERI)

Adding an IE systems for a new language ...

Page 7: Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366)

© CROSSMARC, Frascati, July 17, 2002

add the new domain-specific ontology and lexicons for add the new domain-specific ontology and lexicons for the languages supported according to the ontology and the languages supported according to the ontology and lexicon schema lexicon schema

add the domain-specific FE schema add the domain-specific FE schema train and test the web spidering tool (page filtering, link train and test the web spidering tool (page filtering, link

scoring) using the provided tools scoring) using the provided tools create the domain-specific training and testing corpus create the domain-specific training and testing corpus

following the corpus collection methodology and using following the corpus collection methodology and using the web annotator tool the web annotator tool

train the monolingual IE systems in the new domain train the monolingual IE systems in the new domain (combining linguistic information with wrapper (combining linguistic information with wrapper induction) induction)

Adding a new domain ...

Page 8: Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366)

© CROSSMARC, Frascati, July 17, 2002

Evaluation Task

Web Web SpideringSpidering

Information ExtractionInformation Extraction

End-user interfaceEnd-user interface

Page 9: Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366)

© CROSSMARC, Frascati, July 17, 2002

CROSSMARC Evaluation Questionnaire