34
Expressing Lexical Complexity in SKOS(XL) Thomas Bandholtz 5th ECOTERM MEETING at FAO, Rome, Italy 05-06 October 2009 innoQ Deutschland GmbH D-40880 Ratingen www.innoq.com [email protected]

Expressing Lexical Complexity in SKOS(XL)

Embed Size (px)

DESCRIPTION

innoQ Deutschland GmbH D-40880 Ratingen www.innoq.com [email protected]. Expressing Lexical Complexity in SKOS(XL). Thomas Bandholtz 5th ECOTERM MEETING at FAO, Rome, Italy 05-06 October 2009. Content. Expressing Lexical Complexity in SKOS(XL) Motivation - PowerPoint PPT Presentation

Citation preview

Page 1: Expressing Lexical Complexity  in SKOS(XL)

Expressing Lexical Complexity in SKOS(XL)

Thomas Bandholtz

5th ECOTERM MEETING at FAO, Rome, Italy

05-06 October 2009

innoQ Deutschland GmbH

D-40880 Ratingen

www.innoq.com [email protected]

Page 2: Expressing Lexical Complexity  in SKOS(XL)

Content

Expressing Lexical Complexity in SKOS(XL)

Motivation

Thesaurus Models with regard to lexical complexity

UMTHES extensions of SKOSXL

Examples using RDF Turtle syntax

5/6 October 2009 2Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 3: Expressing Lexical Complexity  in SKOS(XL)

Motivation

What is „lexical complexity“?

Why should we care?

The case: UMTHES in SKOS

Umweltbundesamt (DE) & innoQ develop iQvoc

Page 4: Expressing Lexical Complexity  in SKOS(XL)

What is „lexical complexity“?

Each Concept may be represented by multiple terms

Preferred / non-preferred term, multilingualism, etc.

Each term may have many lexical representations

inflection

abbreviation

“legal” variants in orthography

historical versions of “legal” orthography (in German: 1880 - 2006)

common misspellings

regional variants in the same language

Each term may be a compound term

a compound term may contain term delimiters (spaces or hyphens)

the components may appear dispersed within a sentence

the components may designate different concepts by themselves.

5/6 October 2009 4Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 5: Expressing Lexical Complexity  in SKOS(XL)

(a side note about orthography)

5/6 October 2009 5Ecoterm 2009: Lexical Complexity SKOS(XL)

“Before compulsory education has been established, it was something to be able to write.”

tb: just like Cervantes, Dante, Goethe, Shakespeare, Whitman, etc.

“Since then, you have to be a proper speller.”

(Peter Bichsel, Der Leser. Das Erzählen. Frankfurter Poetik-Vorlesungen. 1982)

Page 6: Expressing Lexical Complexity  in SKOS(XL)

Why should we care?

Traditional: (nice-to-have):

Alphabetic lists of subject indices show some lexical variants.

Contemporary (prerequisite):

automatic (machine-made) detection of Concepts covered by a natural language document (“Named Entity Recognition”)

must capture a covered Concept as concise as possible

considering all possible lexical appearances, including term composition

Language dependant:

English is comparatively simple in this regard.

German is awful!

(add your language here)

5/6 October 2009 6Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 7: Expressing Lexical Complexity  in SKOS(XL)

The case: UMTHES in SKOS

The German Environmental Thesaurus UMTHES

~ 12,000 preferred + 25,000 non-preferred terms + 11 000 'multiple-composition' (spelling) forms

needs to be serialized in SKOS for migration into the iQvoc vocabulary management tool

includes sophisticated knowledge about lexical complexity

we don‘t want to loose this moving to SKOS(XL)

5/6 October 2009 7Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 8: Expressing Lexical Complexity  in SKOS(XL)

UBA(de) & innoQ develop …

iQvoc - Open Source Vocabulary Management Tool

Totally Web-based, supports distributed editorial teams

Safe and comfortable, schema driven editing features

Simple but powerful workflow implementation

Conformance

W3C “Cool URI” design and deployment

W3C SKOS Recommendation

Availability

GNU public license (GPL)

iQvoc version 1 demo (GEMET) at:http://apps.innoq.com/iqvoc/about.html

iQvoc 2 availability planned for Q1 2010

5/6 October 2009 8Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 9: Expressing Lexical Complexity  in SKOS(XL)

Thesaurus modelswith regard to lexical complexity

– Traditional - ISO 2788:1986

– ISO Model revised (Draft 2008-11-18)

– SKOS W3C Recommendation 2009-08-18

Page 10: Expressing Lexical Complexity  in SKOS(XL)

Traditional - ISO 2788:1986

“Guidelines for the establishment and development of monolingual thesauri”

indexing language: “A controlled set of terms selected from natural language and used to represent, in summary form, the subjects of documents.”

thesaurus: “The vocabulary of a controlled indexing language, formally organized …”

preferred term: “A term used consistently when indexing to represent a given concept … sometimes known as descriptor.“

non-preferred term: “The synonym or quasi-synonym of a preferred term. A non-preferred term is not assigned to documents but is provided as an entry point … sometimes known as a non-descriptor"

5/6 October 2009 10Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 11: Expressing Lexical Complexity  in SKOS(XL)

ISO 2788:1986 Model (1)

5/6 October 2009 11Ecoterm 2009: Lexical Complexity SKOS(XL)

(hierarchical and associativerelations between preferred terms here not in focus)

term equivalence

see next slide

Page 12: Expressing Lexical Complexity  in SKOS(XL)

ISO 2788:1986 Model (2)

compound term: “An indexing term which can be factored morphologically into separate components, each of which could be expressed, or re-expressed, as a noun that is capable of serving independently as an indexing term.

a) the focus or head, i.e. the noun component which identifies the general class of concepts to which the term as a whole refers. Examples: ‘printed indexes’, ‘hospitals for children’.

b) The difference or modifier, i.e. one or more further components which serve to narrow the extension of the focus and so specify one of its subclasses. Examples: ‘printed indexes’, ‘hospitals for children’.

The focus and its difference(s) may be written as separate words, as in ‘dining rooms’ and ‘soup spoons’, or they may be concatenated into single words, as in ‘bedrooms’ and ‘teaspoons’”.

5/6 October 2009 12Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 13: Expressing Lexical Complexity  in SKOS(XL)

ISO Model revised (Draft 2008-11-18)

Leonard Will 2009-02-13 in the public SKOS mailing list:

“I write as Chair of the ‘Data Modeling, Exchange Formats and Protocols’ subgroup of the ISO working group SC9WG8/Project 25964, currently revising the ISO standard for thesauri for information retrieval, but as these standards are still in draft form anything I say here is my own interpretation of the way we are going, and is not authoritative”. …

“The ISO model is firmly based on relationships between concepts, not terms. Terms are used as labels for concepts, as in SKOS”.

http://lists.w3.org/Archives/Public/public-esw-thes/2009Feb/0033.html

(see diagram on next slide)

5/6 October 2009 13Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 14: Expressing Lexical Complexity  in SKOS(XL)

ISO Model revised (Draft 2008-11-18)

5/6 October 2009 14Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 15: Expressing Lexical Complexity  in SKOS(XL)

W3C SKOS Recommendation

Simple Knowledge Organization System

“SKOS is an area of work developing specifications and standards to support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading systems and taxonomies within the framework of the Semantic Web …”

Started in 2004: http://www.w3.org/2004/02/skos/

2009-08-18: W3C Recommendation status

SKOS Reference: http://www.w3.org/TR/2009/REC-skos-reference-20090818/

SKOS Primer: http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/

SKOS Use Cases and Requirements: http://www.w3.org/TR/2009/NOTE-skos-ucr-20090818/

5/6 October 2009 15Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 16: Expressing Lexical Complexity  in SKOS(XL)

SKOS Model

about Concepts not terms

5/6 October 2009 16Ecoterm 2009: Lexical Complexity SKOS(XL)

“anything“ can have these

labels (~terms) and notes

includes relations known from ISO “preferred term”:

hierarchical ,associative,

but not equivalence

~ ISO node label

Page 17: Expressing Lexical Complexity  in SKOS(XL)

ISO 2788:1986 mapped to SKOS

5/6 October 2009 17Ecoterm 2009: Lexical Complexity SKOS(XL)

ISO 2788:1986 ~ SKOS (without XL)

document out of scope

indexing language n/a, (may be described as the set of all values assigned to prefLabel or altLabel properties of Concept instances in a ConceptScheme)

thesaurus ConceptScheme (any kind of "controlled structured vocabulary“)

mentioned but not defined Concept “An idea or notion; a unit of thought.”

indexing term n/a, indexing should use Concept references

• preferred term value of prefLabel assigned to a Concept instance

• non-preferred term value of altLabel assigned to a Concept instance

• compound term n/a.

node label Collection

term hierarchy broader/narrower not between terms but Concept instances

term association related not between terms but Concept instances

term equivalence n/a, (may be seen between values assigned to prefLabel / altLabel of the same Concept instance

Scope note, definition note (changeNote, definition, editorialNote, example, scopeNote, …)

Page 18: Expressing Lexical Complexity  in SKOS(XL)

What is added by SKOSXL?

skosxl:Label is a Class not a literal

skosxl:Label has (exactly one) literalForm

skosxl:Label can have labelRelation to another Label

What you don’t see in the diagram:

skos:prefLabel etc. are extended by a „property chain“(seen from a rdfs:Resource) :the value of an assigned skos:prefLabel is equivalent to the value of the skosxl:literalForm of an assigned skosxl:Label.

5/6 October 2009 18Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 19: Expressing Lexical Complexity  in SKOS(XL)

Extensions of SKOSXL by UMTHES

properties of skosxl:Label complementing skosxl:literalForm baseForm inflectional “root” of the term (add suffixes to this)

inflectionalCode encoding of a regular inflectional pattern

lexicalVariant any lexical variant that may appear in a written document

inflectional - derived by inflection

acronym - any kind of abbreviation

cultural - any (sub) cultural variation

misspelled - common spelling errors

subProperties of skosxl:labelRelation

homograph homograph part of a qualified name

hasQualifier qualifier part of a qualified name

lexicalExtension may point to historical orthography, or verb form, etc.

compoundFrom composition (value is a rdf:List)

5/6 October 2009 19Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 20: Expressing Lexical Complexity  in SKOS(XL)

Examples using SKOS(XL)(mostly stripped down to a topic)

Page 21: Expressing Lexical Complexity  in SKOS(XL)

Switching to Turtle Syntax

Terse RDF Triple Language

W3C Team Submission 14 January 2008

http://www.w3.org/TeamSubmission/turtle/ by TBL

Used in W3C SKOS Recommendation as well as in OWL 2 Draft

Everything can be expressed in XML as well.

Turtle syntax makes more sense for human reading.

see yourself …

5/6 October 2009 21Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 22: Expressing Lexical Complexity  in SKOS(XL)

UMTHES in SKOS(XL) examples

Namespace prefixes used in the following:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

@prefix owl: <http://www.w3.org/2002/07/owl#>.

@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

@prefix skosxl: <http://www.w3.org/2008/05/skos-xl#>.

@prefix ext: <http://www.uba.de/2009/08/UmThesScheme#>.# no prefix means: defined in the local namespace

5/6 October 2009 22Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 23: Expressing Lexical Complexity  in SKOS(XL)

waste & garbage

# SKOS only

:4711 rdf:type skos:Concept;

skos:prefLabel “waste”;

skos:altLabel “garbage”.

# exactly the same in SKOSXL

:4711 rdf:type skos:Concept;

skosxl:prefLabel :waste;

skosxl:altLabel :garbage.

:waste rdf:type skosxl:Label;

skosxl:literalForm “waste”.

:garbage rdf:type skosxl:Label;

skosxl:literalForm “garbage”.

5/6 October 2009 23Ecoterm 2009: Lexical Complexity SKOS(XL)

NOTE: Local instance identifiers (:4711, :waste, :garbage, etc.) in these examples follow a local naming convention which addresses human reading only.

“4711” used to be the brand name of a Cologne based perfume manufacturer (“Eau de Cologne”). This has emerged to a generic ID symbol in informatics in the 80/90s. So, :4711 stands for “any kind of unique, but by itself meaningless ID”.

The only functional requirements for IDs in this place are:• being unique within the assigned namespace;• being part of a working http URI.

NOTE: Local instance identifiers (:4711, :waste, :garbage, etc.) in these examples follow a local naming convention which addresses human reading only.

“4711” used to be the brand name of a Cologne based perfume manufacturer (“Eau de Cologne”). This has emerged to a generic ID symbol in informatics in the 80/90s. So, :4711 stands for “any kind of unique, but by itself meaningless ID”.

The only functional requirements for IDs in this place are:• being unique within the assigned namespace;• being part of a working http URI.

Page 24: Expressing Lexical Complexity  in SKOS(XL)

waste & garbage

# SKOS only

:4711 rdf:type skos:Concept;

skos:prefLabel “waste”;

skos:altLabel “garbage”.

# exactly the same in SKOSXL

:4711 rdf:type skos:Concept;

skosxl:prefLabel :waste;

skosxl:altLabel :garbage.

:waste rdf:type skosxl:Label;

skosxl:literalForm “waste”.

:garbage rdf:type skosxl:Label;

skosxl:literalForm “garbage”.

# this looks like saying the same stuff in a more complicated way

# but wait ...

5/6 October 2009 24Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 25: Expressing Lexical Complexity  in SKOS(XL)

“waste water” composition

:4711 rdf:type skos:Concept;

skosxl:prefLabel :wasteWater.

:wasteWater rdf:type skosxl:Label;

skosxl:literalForm “waste water”;

ext:lexicalVariant “wastewater”;

ext:compoundFrom (:waste :water).

# already defined in the previous slide, could skip it here:

:waste rdf:type skosxl:Label;

skosxl:literalForm “waste”.

# only the noun, “wasted water” is NOT “waste water”!

:water rdf:type skosxl:Label;

skosxl:literalForm “water”;

ext:inflectional “waters”.

5/6 October 2009 25Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 26: Expressing Lexical Complexity  in SKOS(XL)

Multiple Composition in German

# @en: technique of facilities for the recycling of waste water

:4711 rdf:type skos:Concept;

skosxl:prefLabel :abwasserAufbereitungsAnlagenTechnik.

:abwasserAufbereitungsAnlagenTechnik rdf:type skosxl:Label;

skosxl:literalForm “Abwasseraufbereitungsanlagentechnik”;

ext:compoundFrom (:abwasser :aufbereitung :anlage :technik);

ext:compoundFrom (:abwasserAufbereitung :anlage :technik);

ext:compoundFrom (:abwasserAufbereitungsAnlage :technik);

ext:compoundFrom (:abwasser :Aufbereitungsanlage :technik);

ext:compoundFrom (:abwasserAufbereitung :anlagenTechnik);

ext:compoundFrom (:abwasser :aufbereitung: :anlagenTechnik);

ext:compoundFrom (:abwasser :aufbereitungsAnlagenTechnik).

# maybe I missed some composition variant?

Not joking!

5/6 October 2009 26Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 27: Expressing Lexical Complexity  in SKOS(XL)

Lexical extension example in German

# in English: “cleaning”

:reinigung rdf:type skosxl:Label;

skosxl:literalForm “Reinigung”@de;

ext:lexicalExtension :reinigen .

# extended by the verb form, English “to clean” Caution: see “wasted water”

:reinigen rdf:type skosxl:Label;

skosxl:literalForm “reinigen“@de;

ext:baseForm “reinig”;

ext:inflectionalCode “007”

ext:inflectional “reinige”;

ext:inflectional “reinigen”;

ext:inflectional “reinigte”;

ext:inflectional “gereinigt”;

ext:inflectional “gereinigte”;

ext:inflectional “gereinigter”;

ext:inflectional “gereinigtes”;

ext:inflectional “reinigend”;

ext:inflectional “reinigende”;

ext:inflectional “reinigender”;

ext:inflectional “reinigendes”; #to be continued …

5/6 October 2009 27Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 28: Expressing Lexical Complexity  in SKOS(XL)

Homograph & qualifier

:4711 rdf:type skos:Concept;

skosxl:prefLabel :bass--fish. # [ˈbas]

:4712 rdf:type skos:Concept;

skosxl:prefLabel :bass--music . # [ˈbās]

:bass rdf:type skosxl:Label;

skosxl:literalForm “bass”.

:fish rdf:type skosxl:Label;

skosxl:literalForm “fish”.

:bass--fish rdf:type skosxl:Label;

skosxl:literalForm “bass (fish)”;

ext:homograph :bass;

ext:hasQualifier :fish.

# add Labels :music and :bass--music using the same pattern

5/6 October 2009 28Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 29: Expressing Lexical Complexity  in SKOS(XL)

Multilingualism (symmetric)

# symmetric (in SKOS, can be expressed in SKOSXL likewise)

:4711 rdf:type skos:Concept;

skos:prefLabel “organisation”@en;

skos:prefLabel “organization”@en-US;

# add your language here ... (GEMET has more than 20)

skos:prefLabel “Organisation”@de.

SKOS integrity condition S14: “A resource has no more than one value of skos:prefLabel per language tag.”

NOTE: this does not mean it must have prefLabel values in multiple languages

5/6 October 2009 29Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 30: Expressing Lexical Complexity  in SKOS(XL)

Multilingualism (language-centric)

# UMTHES is German-centric with altLabel values also in English

:4711 rdf:type skos:Concept;

skos:prefLabel “Organisation”@de;

skos:altLabel “organisation”@en;

skos:altLabel “organization”@en-US.

# or use skosxl: in the above to refer to:

:Organisation rdf:type skosxl:Label;

skosxl:literalForm “Organisation”@de;

ext:inflectional “Organisationen”;

ext:inflectional “Organisations-”.

:organisation rdf:type skosxl:Label;

skosxl:literalForm “organisation”@en;

ext:inflectional “organisations”.

:organization rdf:type skosxl:Label;

skosxl:literalForm “organization”@en-US;

ext:inflectional “organizations”.

5/6 October 2009 30Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 31: Expressing Lexical Complexity  in SKOS(XL)

Multilingualism (asymmetric)

# full asymmetric pattern (currently not used by UMTHES)

:4711 rdf:type skos:Concept;

skosxl:prefLabel :Organisation;

ext:hasTranslation :4712.

:4712 rdf:type skos:Concept;

skosxl:prefLabel :organisation.

ext:hasTranslation :4711.

# :Organisation & :organisation already known from previous slide

5/6 October 2009 31Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 32: Expressing Lexical Complexity  in SKOS(XL)

About Federation

UMTHES has been one of the 8 sources of GEMET UMTHES extends GEMET with more detailed German Concepts and their

lexical complexity.

@prefix gemet: <http://www.eionet.europa.eu/gemet/concept/>.

# GEMET URIs do resolve in SKOS since 2009-09 !!!

:14452 rdf:type skos:Concept;

skosxl:prefLabel :klimaAenderung;

skosxl:altLabel :klimaWandel;

skosxl:altLabel :climateChange;

# referencing GEMET “climatic change” from here

skos:closeMatch gemet:1471.

:klimaAenderung rdf:type skosxl:Label;

ext:compoundFrom (:klima :aenderung);

# ... etc, as exemplified before

5/6 October 2009 32Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 33: Expressing Lexical Complexity  in SKOS(XL)

preferred, non-preferred term again

# you may define such classes in SKOS (OWL) at any time

# but they will never be exactly equivalent to ISO 2788 (why?)

:isPrefLabelOf owl:inverseOf skosxl:prefLabel.

:isAltLabelOf owl:inverseOf skosxl:altLabel.

:PreferredTerm owl:equivalentClass [

rdf:type owl:Restriction ;

owl:onProperty :isPrefLabelOf ;

owl:someValuesFrom skos:Concept ].

:NonPreferredTerm owl:equivalentClass [

owl:intersectionOf (

[owl:complementOf :PreferredTerm ]

[owl:equivalentClass [

rdf:type owl:Restriction ;

owl:onProperty :isAltLabelOf ;

owl:someValuesFrom skos:Concept ]

])].

5/6 October 2009 33Ecoterm 2009: Lexical Complexity SKOS(XL)

Page 34: Expressing Lexical Complexity  in SKOS(XL)

Finally …

# you may express anything in RDF / Turtle …

@prefix foaf: <http://xmlns.com/foaf/spec#>.

:ecoTerm2009 rdf:type :meeting;

:hasOnAgenda :theseSlides.

:theseSlides rdf:type :presentation;

skos:preflabel “Expressing Lexical Complexity in SKOS(XL)”;

:hasPresenter :tb.

:tb rdf:type foaf:person;

foaf:mbox <mailto:[email protected]>;

foaf:isPrimaryTopicOf <http://www.bandholtz.eu/foaf.rdf>;

foaf:workplaceHomepage <http://www.innoq.com>;

foaf:currentProject <http://apps.innoq.com/iqvoc/about.html>;

# add your assertions here ...

:says “Good Buy!”.

5/6 October 2009 34Ecoterm 2009: Lexical Complexity SKOS(XL)