67
Language Technology I Language Technology I © 2006 © 2006 Paul Buitelaar Paul Buitelaar Language Technology I 2005/06 Paul Buitelaar German Research Center for Artificial Intelligence (DFKI) Knowledge Extraction/Semantic Web

Knowledge Extraction Semantic Web

Embed Size (px)

Citation preview

Page 1: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Language Technology I2005/06

Paul BuitelaarGerman Research Center for Artificial Intelligence (DFKI)

Knowledge Extraction/Semantic Web

Page 2: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Overview

Semantic Web Introduction Semantic Web Representation and Query Languages Semantic Web Tools

Ontologies and Knowledge Markup Ontologies and other Knowledge Organization Systems Knowledge Markup for Ontology Population Ontology Life-Cycle

Knowledge Extraction Ontology Population Ontology Learning

Page 3: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semantic Web

Page 4: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

WebDocs, Data

Web

Page 5: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

WebDocs, Data

KnowledgeMarkup

Web > Semantic Web

Page 6: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

WebDocs, Data

KnowledgeMarkup Ontologies

Web > Semantic Web

Page 7: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

KnowledgeMarkup Ontologies

Web > Semantic Web

Page 8: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

KnowledgeMarkup Ontologies

Semantic Web Services

Accessing the Semantic Web - Machines

Page 9: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Intelligent Man-Machine Interface

KnowledgeMarkup Ontologies

Semantic Web Services

Accessing the Semantic Web - Humans

Page 10: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semantic Web Layer cake

• Introduced by Tim Berners-Lee in 2001• Built upon existing WWW standards

Page 11: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Resource Description Framework (RDF)

• RDF is an extensible language for expressing graph-structures• Serializes to XML

node1

DFKI GmbH

Kaiserslautern

<?xml version=‘1.0’ ?><rdf:RDF

xmlns:rdf=“… rdf-syntax-ns#”xmlns:rdfs=“… rdf-schema#”xmlns=“http://example.org”>

<rdf:Description rdf:nodeID=“node1”><name>DFKI GmbH</name><location>Kaiserslautern</location><www rdf:resource=“http://www.dfki.de” />

</rdf:Description></rdf:RDF>

name

location

www http://www.dfki.de

Page 12: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

RDF Schema (RDFS)

• Adds a vocabulary for representing classes and properties to RDF

Person Teacher

Student

rdf:Literal

name

Course

teaches

enrolledInis-

a

is-a

Page 13: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Web Ontology Language (OWL)

• OWL - Based on Description Logics • Adds further modelling vocabulary on top of RDFS

XML Schema Namespaces Interpretation

Context

RDF Schema

OWL

Formalization:

Classes (Inheritance),

Properties

Formalization:

Classes, Class Definitions,

Properties, Property Types

(e.g. Transitivity)

Data Types

XML

RDF

Syntax Semantics

Page 14: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semantic Web Query Languages - SPARQL

• SPARQL - query language developed by W3C• Syntactically based on SQL:

• Results available as XML Documents

PREFIX foaf: <http://xmlns.com/foaf/0.1/>SELECT ?foafName WHERE {

?x foaf:name ?foafName .OPTIONAL { ?x foaf:mbox ?mbox } .

}

Page 15: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semantic Web Tools

Programming APIs Jena - Java Redland – Python, … RAP - PhP

Editors Protégé OntoStudio Triple20 - Prolog

Storage Sesame OntoBroker

Page 16: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontologies and Knowledge Markup

Page 17: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontologies in Philosophy

• Ontology is a branch of philosophy that deals with the nature and the organization of reality

• Science of Being (Aristotle, Metaphysics) What characterizes being? Eventually, what is being?

Page 18: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontologies in Computer Science

Ontology refers to an engineering artifact a specific vocabulary used to describe a certain reality a set of explicit assumptions regarding the intended meaning of the

vocabulary

An Ontology is an explicit specification of a conceptualization [Gruber 93] a shared understanding of a domain of interest [Uschold/Gruninger

96]

Page 19: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Why Develop an Ontology?

• Make domain assumptions explicit Easier to change domain assumptions Easier to understand and update legacy data

• Separate domain knowledge from operational knowledge Re-use domain and operational knowledge separately

• A community reference for applications

• Shared understanding of what information means

Page 20: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Types of Ontologies

[Guarino, 98]

Describe very general concepts like space, time, event, which are independent of a particular problem or domain. It seems

reasonable to have unified top-level ontologies for large communities of users.

Describe the vocabulary related

to a generic domain by

specializing the concepts introduced

in the top-level ontology.

Describe the vocabulary related to a

generic task or activity by

specializing the top-level

ontologies.

These are the most specific ontologies. Concepts in application ontologies often correspond to roles played by domain entities

while performing a certain activity.

Page 21: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontologies and Their Relatives

Catalog / ID

Terms/Glossary

Thesauri

InformalIs-a

FormalIs-a

FormalInstance

Frames

ValueRestric-tions

Generallogical

constraints

AxiomsDisjointInverse Relations,...

Page 22: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Knowledge Organization Systems

• Semantic Lexicons – e.g. WordNet … group together words according to lexical semantic

relations like synonymy, hyponymy, meronymy, antonymy, etc.

• Thesauri …group together domain terms according to a set of

taxonomic relations, including broader term, narrower term, sibling, etc.

• Semantic Networks and Ontologies … group together classes of objects according to a set of

relations that originate in the nature of the domain of application.

Ontologies are defined by a formal semantics, but semantic networks may be informally defined. Therefore all ontologies are semantic networks, but not all semantic networks are ontologies.

Page 23: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Thesauri - Examples

MeSH Heading Databases, GeneticEntry Term Genetic DatabasesEntry Term Genetic Sequence DatabasesEntry Term OMIMEntry Term Online Mendelian Inheritance in ManEntry Term Genetic Data BanksEntry Term Genetic Data BasesEntry Term Genetic DatabanksEntry Term Genetic Information DatabasesSee Also Genetic Screening

MT 3606 natural and applied sciencesUF gene pool

genetic resourcegenetic stockgenotypeheredity

BT1 biologyBT2 life sciencesNT1 DNANT1 eugenicsRT genetic engineering (6411)

EuroVoc covers terminology in all of the official EU languages for all fields that concern the EU institutions, e.g., politics, trade, law, science, energy, agriculture, 27 such fields in total.

MeSH (Medical Subject Headings) is organized by terms (currently over 250,000) that correspond to a specific medical subject. For each such term a list of syntactic, morphological or semantic variants is given.

Page 24: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semantic Networks - Examples

Pharmacologic Substance affects Pathologic FunctionPharmacologic Substance causes Pathologic FunctionPharmacologic Substance complicates Pathologic FunctionPharmacologic Substance diagnoses Pathologic FunctionPharmacologic Substance prevents Pathologic FunctionPharmacologic Substance treats Pathologic Function

Accession: GO:0009292Ontology: biological processSynonyms: broad: genetic exchangeDefinition: In the absence of a sexual life cycle, the processes involved in the

introduction of genetic information to create a genetically different individual.Term Lineage all : all (164142)

GO:0008150 : biological process (115947)GO:0007275 : development (11892)

GO:0009292 : genetic transfer (69)

GO (Gene Ontology) allows for “consistent descriptions of gene products in different databases, including several of the world’s major repositories for plant, animal and microbial genomes…“ Organizing principles are molecular function, biological process and cellular component.

UMLS (Unified Medical Language System) integrates linguistic, terminological and semantic information. The Semantic Network consists of 134 semantic types and 54 relations between types.

Page 25: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Example Ontology

Consider an Example Ontology for the Newspaper Domain

Page 26: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

• Ontologies are used to semantically organize and retrieve data (structured, textual, multimedia) through knowledge markup

Consider the following example:

• Knowledge Markup from Text is based on Named-Entity Recognition, Semantic Tagging (Term to Class Mapping) and Relation Extraction

Knowledge Markup

<news:story xmnls:jobs=“http://www.jobs.org/owl-jobs#” xmlns:com=“http://www.companies.org/owl-companies#” xmlns:it=“http://www.it.net/owl-it#”>

“We were surprised by several of the results, particularly the order of finish,” said <jobs:SystemsAnalyst>Dan Olds</jobs:SystemsAnalyst>. <com:Company>IBM</com:Company> finished first with very strong results, and <com:Company>HP</com:Company> scored a solid number two; we expected to see <com:Company>Sun Microsystems</com:Company> challenging for first place or at least a strong second place. As the largest <it:operatingsystem>UNIX</it:operatingsystem> vendor in terms of number of installed systems, a third place finish should put their management on notice that their installed base may be vulnerable.

Page 27: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Knowledge Markup - Images

Semantic Annotation of Medical Images

(miAKT Project - UK)

Page 28: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Knowledge Markup - Images

Semantic Annotation of Video

(SmartMedia – DFKI KM)

Page 29: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontology Life-Cycle

Create/SelectDevelopment and/or Selection

PopulateKnowledge Base Generation

ValidateConsistency Checks

EvolveExtension, Modification

MaintainUsability Tests

DeployKnowledge Retrieval

Page 30: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Knowledge Extraction

Ontology Population & Ontology Learning

Page 31: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontology Life-Cycle – Ontology Population

Create/SelectDevelopment and/or Selection

PopulateKnowledge Base Generation

ValidateConsistency Checks

EvolveExtension, Modification

MaintainUsability Tests

DeployKnowledge Retrieval

Page 32: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontology Population with SOBA

SOBA: SmartWeb Ontology-based Annotation

Application Context SmartWeb (http://www.smartweb-projekt.de/) – German Project around World-

Cup 2006 Integrates

Multimodal Dialog Processing IR-based Question Answering Ontology-Based Information Extraction Semantic Web Services

Ontology-Based Information Extraction … Combines:

Semantic Wrapping of Semi-Structured Data Semantic and Linguistic Annotation of Free Text Inference Rules for Instantiation and Integration of Annotated Entities and

Events

… and Display Ontology-driven Hyperlink Generation for Display of Extracted Information

Page 33: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Linguistic AnnotationLinguistic Annotation

Named Entity Recognition&

Semantic Tagging

Named Entity Recognition&

Semantic Tagging

Image ExtractionImage Extraction

PDF Analysis PDF Analysis

Inference Rules forInstantiation &

Integration

Inference Rules forInstantiation &

Integration

KnowledgeBase

DocumentsOntologies

Wrapping of SemiStructured Data

Wrapping of SemiStructured Data

SOBA – Processing and Data Flow

Page 34: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

SWIntO: SmartWeb Integrated Ontology

SmartDOLCE:Entity

SmartSUMO:Attribute

SmartSUMO:SocialRole

SmartSUMO:Proposition

SportEvent:FootballPlayer

SportEvent:Goalkeeper

SportEvent:FootballOrganizationPerson

SportEvent:FootballClubPresident

… …

SWIntO (by AIFB, DFKI KM/IUI, EML) covers Foundational (DOLCE) and General (SUMO)

Knowledge Domain- and Task-Specific Knowledge

Football / Sport Events Navigation, Discourse, Multimedia other

Page 35: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

SMartWeb Integrated Ontology (by AIFB, DFKI KM/IUI, EML)

Page 36: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Page 37: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

SmartWeb Corpus

(Growing) Web Corpus through Monitor on http://fifaworldcup.yahoo.com/ http://www.uefa.com/competitions/worldcup

Semi-Structured Data Tabular: Match Reports, Teams, etc.

Free Text Match Reports Image Captions

Page 38: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semi-Structured Data - HTML

Page 39: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semi-Structured Data - XML

Page 40: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Semi-Structured Data – F-Logic

Page 41: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

MatchEvent [Score, Team1, Team2]

FootballPlayer

Information Extraction from Free Text

Page 42: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

FoulEvent [FootballPlayer]

FootballPlayer

Information Extraction from Image Captions

Page 43: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Linguistic and Semantic Annotation

Mark Crossley saved twice with his legs from Huckerby.

Named Entity Recognition & Semantic Tagging

[Mark Crossley GOALKEEPER] [saved GOALKEEPER_ACTION] twice with his legs from [Huckerby PLAYER].

Linguistic Annotation

[Mark Crossley GOALKEEPER : SUBJ] [saved PRED : GOALKEEPER_ACTION] twice [with his legs PP_OBJ] [from [Huckerby PLAYER] PP_ADJUNCT].

[ GOALKEEPER_ACTION = 'save‘, GOALKEEPER = 'Mark Crossley‘, PLAYER = 'Huckerby‘, MANNER = ‘legs']

Page 44: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Annotation/Extraction Example

Example Sentence from Match Report

Allerdings ist Petrow fuer die Partie gegen Schweden gesperrt und kann erst gegen Ungarn eingesetzt werden.

“However Petrow has been banned for the match against Sweden and can again be deployed against Hungary.”

Annotated/Extracted Information (with SProUT IE Tool - DFKI-LT )

player_action & [GAME_EVENT "Ban", AGENT player & [SURNAME "PETROW"], IN_MATCH game & [TEAM2 "SWE",

TOURNAMENT "Match"]] team & [NAME "HUN"]

Page 45: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Knowledge Base Generation

<type orig="player" target="dolce#natual-person-denomination> <link type="dolce#natural-person" method="dolce#HAS-DENOMINATION"

id=""/> <map> <simple-mapping> <input>

<arg orig="GIVEN_NAME" target="VAR1"/> </input> <output method="dolce#FIRSTNAME" value="VAR1"/> </simple-mapping> <simple-mapping> <input> <arg orig="SURNAME" target="VAR1"/> </input> <output method="dolce#LASTNAME" value="VAR1"/> </simple-mapping> </map></type>

Transformation of SProUt Output to F-Logic via Declarative Mappings, e.g.:

Page 46: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

SProUt to F-Logic

FS type="player_action">

[N [N <F name="GAME_EVENT">

<FS type="world champion"/>

<F name="ACTION_TIME">

<FS type="1990"/>

<F name="ACTION_LOCATION">

<FS type="Italy"/>

<F name="AGENT">

<FS type="player">

<F name="SURNAME">

<FS type="Buchwald"/>

<F name="GIVEN_NAME">

<FS type="Guido"/>

soba#player124:sportevent#FootballPlayer

[sportevent#impersonatedBy -> soba#Guido_BUCHWALD].

soba#Guido_BUCHWALD:dolce#"natural-person"

[dolce#"HAS-DENOMINATION" -> soba#Guido_BUCHWALD_Denomination].

soba#Guido_BUCHWALD_Denomination":dolce#"natural-person-denomination"

[dolce#LASTNAME -> "Buchwald"; dolce#FIRSTNAME -> "Guido"].

SProUt F-Logic

Page 47: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

A Complex Example

semistruct#"Bolivien_vs_Brasilien_09_Oct_05_16_00_Luis_CRISTALDO":sportevent#FieldMatchFootballPlayer [ externalRepresentation@(de) ->> "Luis CRISTALDO (7)"; sportevent#number -> 7; sportevent#impersonatedBy -> semistruct#"Luis_CRISTALDO"

].

semistruct#"Bolivien_vs_Brasilien_09_OCt_05_16_00" [ sportevent#matchEvents -> soba#ID25 ].

soba#ID25:sportevent#Foul [ sportevent#commitedBy -> semistruct#"Bolivien_vs_Brasilien_09_Oct_05_Luis_CRISTALDO ].

mediainst#ID67:media#Picture [ media#URL -> "http://fifaworldcup.yahoo.com/06/de/photos/index.html?aid=124155&d=1"; media#shows -> ID25 ].

Page 48: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Display of Extracted Information

Page 49: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Ontology Life-Cycle – Ontology Learning

Create/SelectDevelopment and/or Selection

PopulateKnowledge Base Generation

ValidateConsistency Checks

EvolveExtension, Modification

MaintainUsability Tests

DeployKnowledge Retrieval

Page 50: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Terms

Concepts

Taxonomy

Relations

Rules & Axioms

disease, doctor, hospital

{disease, illness, Krankheit}

DISEASE:=<Int, Ext, Lex>

is_a(DOCTOR, PERSON)

cure(dom:DOCTOR, range:DISEASE)

(Multilingual) Synonyms

))(),((, xillyxsufferFromyx

Introduced in: Philipp Cimiano, PhD Thesis University of Karlsruhe, forthcoming

Ontology Learning Layer Cake

Page 51: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Some Current Work on Ontology Learning from Text

Term Extraction Statistical Analysis Patterns (Shallow) Linguistic Parsing Term Disambiguation & Compositional Interpretation Combinations

Taxonomy Extraction Statistical Analysis & Clustering (e.g. FCA) Patterns (Shallow) Linguistic Parsing WordNet Combinations

Relation Extraction Anonymous Relations (e.g. with Association Rules) Named Relations (Linguistic Parsing) (Linguistic) Compound Analysis Web Mining, Social Network Analysis Combinations

Definition Extraction (Linguistic) Compound Analysis (incl. WordNet)

Overview of Current Work: Paul Buitelaar, Philipp Cimiano, Bernardo Magnini Ontology Learning from Text: Methods, Evaluation and Applications Frontiers in Artificial Intelligence and Applications Series, Vol. 123, IOS Press, July 2005.

Page 52: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Terms

Concepts

Taxonomy

Relations

Rules & Axioms

disease, doctor, hospital

{disease, illness, Krankheit}

DISEASE:=<Int, Ext, Lex>

is_a(DOCTOR, PERSON)

cure(dom:DOCTOR, range:DISEASE)

(Multilingual) Synonyms

))(),((, xillyxsufferFromyx

RelExt - Relation Extraction for Ontology Learning

Page 53: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

RelExt - Motivation

Extend Ontology with Relations Currently ~ 60 Relations in the Sport Events Ontology

– Mostly Properties, e.g. hasName, atMinute, … Representation of (Verbal) Relations Enables Better Modeling

of Events for Information Extraction Purposes

Example

“Ballack shoots the ball in the net.”

Relation:Shoot (Domain:FootballPlayer Range:BallObject)

Page 54: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

RelExt – System Architecture

Named-Entity Rec. & Semantic Tagging

Shallow Parsing

Corpus

AnnotatedCorpus

Relevance Measure

FrequenciesIn BNC, NZZ

Relevance ScoresHeads, Preds

Co-occurrence Measure

Co-occurrenceScores

Heads <> Preds

Linguistic Annotation Statistical Processing

TripleGenerationTriples

Head : Pred : HeadEvaluation

Relation Extraction and Evaluation

Page 55: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Linguistic Annotation

Named-Entity Recognition“Michael Ballack” : FootballPlayer

Semantic Tagging“Ball” (ball), “Leder” (leather) : BallObject

Shallow Parsing Part-of-Speech Tagging

Fussballspieler (soccer player): Noun

Morphological AnalysisFussballspieler: Fussball – Spieler

Dependency Structure Analysis“The team won the second match.” SUBJECT PREDICATE DIRECT_OBJECT

Page 56: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Relevance Ranking

Top-10 Head-Nouns before and after mapping to Ontology Classes

Rank Headnoun Frequency1 125245.24 68492 121888.52 7767

3 95003.21 59674 64157.18 3575

5 57185.76 31326 45474.96 2298

7 34668.11 17528 30017.75 1561

9 27989.09 147910 27414.66 1457

2

Ball (ball)Tor (goal)

Meter (meters)Schuss (shot / drive)

Ecke (corner)Strafraum (penalty area)

Freistoss (freekick)Leder (leather / ball)Flanke (cross)

Pfosten (post)

Rank Concept Label Frequency1 565510.99 FOOTBALLPLAYER 284942 162137.82 GOALOBJECT 81883 143528.88 BALLOBJECT 72494 138535.44 GOALKEEPER 68875 70814.86 SHOT 35786 49018.16 TEAM 24777 45474.96 PENALTYAREA 22988 34668.11 FREEKICK 17529 29324.54 WING 1482

10 28829.78 POST 1457

2

Rank Predicate Frequency1 27167.41 13732 22045.39 1435

3 21908.37 15034 20439.09 1033

5 16342.99 8266 9563.41 1548

7 9468.57 8148 7752.84 1559

9 7653.68 53710 7637.45 405

2

flanken (to cross)klaeren (to clear)

schiessen (to shot)koepfen (to head)

lassen (to let / to leave)ziehen (to pull / to drag)

passen (to pass / to play)spielen (to play / to pass)lenken (to divert)

parieren (to parry / to save)

Top-10 Predicates

Page 57: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Co-Occurrence Analysis

Rank Concept Label Frequency1 565510.99 FOOTBALLPLAYER 284942 162137.82 GOALOBJECT 81883 143528.88 BALLOBJECT 72494 138535.44 GOALKEEPER 6887

2

Rank Predicate Frequency1 27167.41 1373

2 22045.39 14353 21908.37 15034 20439.09 1033

2

flanken (to cross)klaeren (to clear)schiessen (to shot)koepfen (to head)

.

.

.

.

.

.

flanken SUBJ:FOOTBALLPLAYER “Klasnic”

flanken DOBJ:FOOTBALLPLAYER “Klose”

flanken_in PP_ADJ “Zuschauer” (audience)

.

.

.

beschimpfen (to insult) SUBJ:FOOTBALLPLAYER “Klasnic”

.

.

.

.

.

.

.

.

.

Page 58: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Integration into Ontology Development

Page 59: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Terms

Concepts

Taxonomy

Relations

Rules & Axioms

disease, doctor, hospital

{disease, illness, Krankheit}

DISEASE:=<Int, Ext, Lex>

is_a(DOCTOR, PERSON)

cure(dom:DOCTOR, range:DISEASE)

(Multilingual) Synonyms

))(),((, xillyxsufferFromyx

OntoLT – Protégé Plug-In for Ontology Extraction from Text

Page 60: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

OntoLT – Basic Idea

Middleware Solution in Ontology Development Supports the Ontology Engineer through Semi-Automatic

Extraction of Ontology Fragments from Domain-Relevant Document Collections

Download http://olp.dfki.de/OntoLT/OntoLT.htm

Based on Automatic Linguistic Annotation Manual Definition of Mapping Rules Statistical Preprocessing (Option) Interactive Validation of Candidates Generation in Protégé of Ontology Fragments

Page 61: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

OntoLT – System Architecture

AnnotatedCorpus(XML)

Mappings

XML (Linguistic Structure) <=>

Protégé (Classes, Slots)

Extraction

Protégé

Edit Extracted Ontology

Corpus

Definitionof Mappings

LinguisticAnnotation

ExtractedOntology

OntoLT

Page 62: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Corpus Example – KMI News

Page 63: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Mapping Rules

Page 64: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Statistical Relevance

Page 65: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Extract Candidates

Page 66: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Generate Ontology Fragments

Page 67: Knowledge Extraction Semantic Web

Language Technology ILanguage Technology I© 2006 © 2006 Paul BuitelaarPaul Buitelaar

Exercises

Knowledge Extraction Ontology Modeling (from Text) Ontology Population Ontology Learning (Extension) Ontology Mapping