27
Machine Learning for Information Machine Learning for Information Integration on the Web Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” http://www.iit.demokritos.gr/skel Dagstuhl, February 15, 2005

Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

Embed Size (px)

Citation preview

Page 1: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

Machine Learning for Information Integration Machine Learning for Information Integration on the Webon the Web

Georgios Paliouras

Software & Knowledge Engineering Lab

Inst. of Informatics & TelecommunicationsNCSR “Demokritos”

http://www.iit.demokritos.gr/skel

Dagstuhl, February 15, 2005

Page 2: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 2

Dagstuhl

SKEL IntroductionSKEL Introduction

• Areas of research activity:– Information gathering (retrieval, crawling, spidering)– Information filtering (text and multimedia classification)– Information extraction (named entity recognition and classification,

role identification, wrappers, grammar and lexicon learning)– Personalization (user stereotypes and communities)

SKEL’s research objective:innovative knowledge technologies for

reducing the information overload on the Web

Page 3: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 3

Dagstuhl

Structure of the talkStructure of the talk

• Web Information integration in CROSSMARC• Learning Context Free Grammars• Meta-learning for Web Information Extraction• Machine Learning for Ontology Maintenance• Conclusions

Page 4: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 4

Dagstuhl

SKEL IntroductionSKEL Introduction

• National Centre for Scientific Research "Demokritos” (GR)• University of Edinburgh (UK)• Universita di Roma Tor Vergata (IT)• VeltiNet A.E. (GR)• Lingway (FR)

CROSSMARC consortium

Page 5: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 5

Dagstuhl

CROSSMARC ObjectivesCROSSMARC Objectives

• crawl the Web for interesting Web pages,• extract information from pages of different sites without

a standardized format (structured, semi-structured, free text),

• process Web pages written in several languages,• be customized semi-automatically to new domains and

languages,• deliver integrated information according to personalized

profiles.

Develop technology for Information Integration that can:

Page 6: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 6

Dagstuhl

CROSSMARC ArchitectureCROSSMARC Architecture

Ontology

Page 7: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 7

Dagstuhl

CROSSMARC OntologyCROSSMARC Ontology

…<description>Laptops</description> <features> <feature id="OF-d0e5"> <description>Processor</description> <attribute type="basic" id="OA-d0e7"> <description>Processor Name</description> <discrete_set type="open"> <value id="OV-d0e1041"> <description>Intel Pentium 3</description> </value> …

<node idref="OV-d0e1041">  <synonym>Intel Pentium III</synonym>   <synonym>Pentium III</synonym>   <synonym>P3</synonym>   <synonym>PIII</synonym></node>

Lexicon

Ontology

<node idref="OA-d0e7">

  <synonym>Όνομα Επεξεργαστή</synonym>

</node>

Greek Lexicon

Page 8: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 8

Dagstuhl

Structure of the talkStructure of the talk

• Web Information integration in CROSSMARC• Learning Context Free Grammars• Meta-learning for Web Information Extraction• Machine Learning for Ontology Maintenance• Conclusions

Page 9: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 9

Dagstuhl

Learning Context Free GrammarsLearning Context Free Grammars

• Infers context-free grammars.• Learns from positive examples only.• Overgenarisation controlled through a heuristic, based

on MDL.• Two basic/three auxiliary learning operators.• Two search strategies:

– Beam search.– Genetic search.

Introducing eg-GRIDS

Page 10: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 10

Dagstuhl

Learning Context Free GrammarsLearning Context Free Grammars

Minimum Description Length (MDL)Minimum Description Length (MDL)

Model Length (ML) Model Length (ML) == GDLGDL ++ DDLDDL

Bits required to encode the grammar G.

Grammar Description Length (GDL)Grammar Description Length (GDL)

Bits required to encode all training examples, as encoded by the grammar G.

Derivations Description Length (DDL)Derivations Description Length (DDL)

Overly Specific Overly Specific GrammarGrammar

Overly Specific Overly Specific GrammarGrammar

Overly General Overly General GrammarGrammar

Overly General Overly General GrammarGrammar

DDLDDL

HypothesesHypothesesHypothesesHypotheses

GDLGDL

Page 11: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 11

Dagstuhl

Learning Context Free GrammarsLearning Context Free Grammars

eg-GRIDS Architectureeg-GRIDS Architecture

Operator Operator ModeMode

Beam of Beam of GrammarsGrammarsBeam of Beam of

GrammarsGrammars

MergeMerge NTNT OperatorOperator

CreateCreate NTNT OperatorOperator

Lea

rnin

g O

per

ator

s

Create Create Optional NTOptional NT

DetectDetect CenterCenter EmbeddingEmbedding

YES

NO

Evo

luti

onar

y A

lgor

ith

m

MutationMutation

Search Organisation Selection

BodyBody SubstitutionSubstitution

Training Training ExamplesExamplesTraining Training

ExamplesExamples

Overly Specific Overly Specific GrammarGrammar

Overly Specific Overly Specific GrammarGrammar

Final Final GrammarGrammar

Final Final GrammarGrammar

Any Inferred Grammar better

than those in beam?

Page 12: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 12

Dagstuhl

Structure of the talkStructure of the talk

• Web Information integration in CROSSMARC• Learning Context Free Grammars• Meta-learning for Web Information Extraction• Machine Learning for Ontology Maintenance• Conclusions

Page 13: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 13

Dagstuhl

D \ DjDj

Meta-learning for Web IEMeta-learning for Web IE

Base-level dataset D

L1…LN

MDj

Meta-level dataset MD

C1(j)…CN(j)

CM

New vector x

C1...CN

Meta-levelvector

Class value y(x)

L1…LN

LM

Stacked generalization

Page 14: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 14

Dagstuhl

Meta-learning for Web IEMeta-learning for Web IE

…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…

Information Extraction is not naturally a classification task

In IE we deal with text documents, paired with templates

Template T

t(s,e) s, e Field f

Transport ZX 47, 49 model

15” 56, 58 screenSize

TFT 59, 60 screenType

Intel <b> Pentium III 63, 67 procName

600 MHz 67, 69 procSpeed

256 MB 76, 78 ram

Each template is filled with instances <t(s,e), f>

Page 15: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 15

Dagstuhl

Meta-learning for Web IEMeta-learning for Web IE

T1 filled by the IE system E1

t(s, e) s, e f

Transport ZX 47, 49 model

15” 56, 58 screenSize

TFT 59, 60 screenType

Intel <b> Pentium III 63, 67 procName

600 MHz 67, 69 procSpeed

256 MB 76, 78 ram

1 GB 81, 83 ram

T2 filled by the IE system E2

t(s, e) s, e f

Transport ZX 47, 49 manuf

TFT 59, 60 screenType

Intel <b> Pentium 63, 66 procName

600 MHz 67, 69 procSpeed

256 MB 76, 78 ram

1 GB 81, 83 HDcapacity

…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…

Combining Information Extraction systems

Page 16: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 16

Dagstuhl

Meta-learning for Web IEMeta-learning for Web IE

Stacked template (ST)

s, e t(s, e) Field by E1 Field by E2 Correct field

47, 49 Transport ZX model manuf model

56, 58 15” screenSize - screenSize

59, 60 TFT screenType screenType screenType

63, 66 Intel<b>Pentium - procName -

63, 67 Intel<b>Pentium III procName - procName

67, 69 600 MHz procSpeed procSpeed procSpeed

76, 78 256 MB ram ram ram

81, 83 1 GB ram HDcapacity -

Creating a stacked template

…TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB SDRAM up to 1GB…

Page 17: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 17

Dagstuhl

D \ Dj

Meta-learning for Web IEMeta-learning for Web IE

Training in the new stacking framework

Dj

L1…LNE1(j)…EN(j)

CM

ST1 ST2 …

L1…LN E1…EN

LMMDj

D = set of documents, paired with hand-filled templates

MD = set of meta-level feature vectors

Page 18: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 18

Dagstuhl

Meta-learning for Web IEMeta-learning for Web IE

Stacking at run-time

New document d

E1

E2

EN

T1

T2

TN

Stacked template CM

TFinal

template

<t(s,e), f>

Page 19: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 19

Dagstuhl

Structure of the talkStructure of the talk

• Web Information integration in CROSSMARC• Learning Context Free Grammars• Meta-learning for Web Information Extraction• Machine Learning for Ontology Maintenance• Conclusions

Page 20: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 20

Dagstuhl

Ontology EnrichmentOntology Enrichment

• Highly evolving domain (e.g. laptop descriptions)– New Instances characterize new concepts.

e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology.– New surface appearance of an instance.

e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’

• We concentrate on instances.

• The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain they cover.

Page 21: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 21

Dagstuhl

Ontology EnrichmentOntology Enrichment

Multi-Lingual Domain Ontology

Additional annotations

Validation

Ontology Enrichment / Population

Domain Expert

Annotating Corpus Using Domain Ontology

Information extraction

machine learning

Corpus

Page 22: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 22

Dagstuhl

Enrichment with synonymsEnrichment with synonyms

• The number of instances for validation increases with the size of the corpus and the ontology.

• There is a need for supporting the enrichment of the ‘synonymy’ relationship.

• Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship).

• Issues to be handled:Synonym : ‘Intel pentium 3’ - ‘Intel pIII’

Orthographical : ‘Intel p3’ - ‘intell p3’

Lexicographical : ‘Hewlett Packard’ - ‘HP’

Combination : ‘Intell Pentium 3’ - ‘P III’

Page 23: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 23

Dagstuhl

Compression-based ClusteringCompression-based Clustering

• COCLU (COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff.

• CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters.

• COCLU iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).

Page 24: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 24

Dagstuhl

Structure of the talkStructure of the talk

• Web Information integration in CROSSMARC• Learning Context Free Grammars• Meta-learning for Web Information Extraction• Machine Learning for Ontology Maintenance• Conclusions

Page 25: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 25

Dagstuhl

SKEL IntroductionSKEL Introduction

• Information integration can benefit from machine learning.• Grammar learning methods have become efficient.• Combining IE systems improves performance.• Ontologies can be used to annotate examples to learn IE

systems and enrich ontologies.• Grammar learning in parallel/combination to ontology

learning?

Conclusions

Page 26: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 26

Dagstuhl

SKEL IntroductionSKEL Introduction

• This is research of many current and past members of SKEL.

• CROSSMARC is joint work of the project consortium.

Acknowledgements

Page 27: Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications

15/2/2005 Machine Learning for Information Integration 27

Dagstuhl

Announcement IJCAI workshopAnnouncement IJCAI workshop

Workshop on Grammatical Inference Applications: Successes and Future Challenges

IJCAI-05, Edinburgh, Scotland

July 31, 2005

Paper submission deadline: March 19, 2005

URL: http://www.ics.mq.edu.au/~menno/IJCAI05/