23
2013-09-03 - Copenhagen - 3rd DARIAH-EU General VCC meeting Matej Ďurčo, ICLTT, Vienna Hennie Brugman, Meertens Institute, Amsterdam Controlled Vocabularies for Digital Humanities

Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

2013-09-03 - Copenhagen - 3rd DARIAH-EU General VCC meeting

Matej Ďurčo, ICLTT, Vienna

Hennie Brugman, Meertens Institute, Amsterdam

Controlled Vocabularies

for Digital Humanities

Page 2: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

2

Outline

• Overview of related activities in different contexts

• Controlled Vocabularies - potential usages and topic areas

• SKOS

• OpenSKOS - Vocabulary Repository

• CLAVAS schema/ontology

• Next steps – Where do we want to go from here?

Page 3: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

3

„Vocabulary“ - disambiguation

• concept vs. term semantic vs. lexical - concept is referred to by one or more terms,

but is not identified by those (“has a life on its own”)

• term list flat

• concept list also flat, but distinguishes between semantic and lexical levels

• taxonomy have (mostly hierarchical) relations between concepts/terms

• schema/ontology both have concepts/entities with properties and different types of relations between them

schema – XML-, DB- world

ontology – knowledge management, semantic web

Page 4: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

4

„Vocabulary“ - disambiguation

discussed here

can use

can grow into

• concept vs. term semantic vs. lexical - concept is referred to by one or more terms,

but is not identified by those (“has a life on its own”)

• term list flat

• concept list also flat, but distinguishes between semantic and lexical levels

• taxonomy have (mostly hierarchical) relations between concepts/terms

• schema/ontology both have concepts/entities with properties and different types of relations between them

schema – XML-, DB- world

ontology – knowledge management, semantic web

Page 5: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

5

• VCC1/Task 5: Data federation and interoperability

• VCC3/Task3: Reference Data Registries – outcomes ?

• => joint task force (started in Vienna, November 2012)

goal:

establish a service providing

controlled vocabularies

and reference data

for the DARIAH

community.

• Schema Registry + Crosswalks but does not seem to belong here

as it is schema level

Activities in DARIAH

M. Hoogerwerf, P. Gietz: VCC 1, Task 2: Core Infrastructure Services

?

Page 6: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

6

• ISOcat – Data Category Registry - registry for defining (linguistic) concepts (“flat” = (almost) no relations)

- implementation of the ISO standard ISO12620:2009

- a cornerstone of CMDI – semantic grounding for MD schemas

www.isocat.org

• Relation Registry - companion to ISOcat to express relations between data categories

- early stage service operational: lux13.mpi.nl/relcat/

• task force on metadata curation - within the SCCTC (Standing Committee for CLARIN Technical Centres)

• CLAVAS - Vocabulary Alignment Service for CLARIN

- initiative originating within CLARIN-NL

- goal: adopt the Vocabulary Repository OpenSKOS for CLARIN needs

• OpenSKOS and controlled vocabularies meeting in Utrecht, 2013-05-17

www.clarin.eu/node/3780

Activities in CLARIN

Page 7: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

Activities elsewhere

number of datasets/vocabularies (and tools/services) already exist especially in the libraries world

• VIAF - Virtual International Authority File - carried by national libraries + OCLC - goal: harmonize/cluster (national) authority files - provides services, search interface, data dumps - vocabularies for: Personal Names, Corporate Names, Geographic Names, etc.

• The European Library (48 national libraries) - vocabulary-based data enrichment - MACS – Multingual Access to Subjects (semi-automatic alignment) - Alignment of DDC and UDC via CERIF carried out - Alignment to other ontologies (Geonames, VIAF) - search service: http://www.theeuropeanlibrary.org/tel4/apisearch

• Library of Congress - LCSH, MADS, ...

• Getty Thesauri

• Geonames - search interface, service, dumps

• LT-World @DFKI - full-blown ontology, rather a candidate for LOD-linking

• many more …

• CoNE – Control of Named Entities @MPDL/eSciDoc http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities

• EATS - Entity Authority Tool Set @New Zealand Electronic Text Centre (NZETC). http://eats.readthedocs.org/en/latest/

7

Page 8: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

Potential usages for CV

• Metadata Authoring, Curation

• Data Enrichment, Annotation

• Search query expansion, autocomplete, facets etc.

• Data Analysis / Exploration

• indispensable building block for moving data to Semantic Web by allowing to resolve strings to entities

• can provide equivalencies between concepts/entities from different vocabularies cf. links in Wikipedia (page for J. W. Goethe): GND: 118540238 | LCCN: n79003362 |

NDL: 00441109 | VIAF: 24602065

=> Linked Data

8

Page 9: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

Topic areas + candidate vocabularies

• Data Categories / Concepts – ISOcat, (dublincore)

• Languages - ISO-639-*

• Countries - country codes

• Organizations - GND, VIAF, dbpedia?

• Persons - GND, VIAF, dbpedia?

• Schlagwörter/Subjects - GND, LCSH, DDC, UDC, MACS, …

• Resource Typology - many attempts

• many other more specialized AAT – Getty Architecture and Arts Thesaurus DDC - Dewey Decimal Classification GND - Gemeinsame Norm Datei (Deutsche National Bibliothek) GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives) VIAF - Virtual International Authority File

9

Page 10: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

10

• concept/entity identified by a PID (coolURI would do)

• support multilinguality / localization

• plurality of conceptualizations

allow multiple (conflicting) vocabularies for the same topic

• vocabulary management / curation as collaborative ongoing process

• base on Semantic Web compliant formats

• share created vocabularies

• reuse existing sources/services

• thematic sub-communities with selected profiles

different sub-communities and groups need different vocabularies

• but with a harmonized access

= „one stop shop“ for controlled vocabularies

good for the providers, the users and for the developers

Requirements / Approach

Page 11: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

Benefits (and risks) of a harmonized system

• for providers:

- simplifies publication of vocabularies

- simplifies reuse of (own) vocabularies in somebody elses tools

- easy to align concepts between vocabularies

• for users

- easy discover, evaluate and use vocabularies

(less need to construct them yourself)

- new browsing and searching possibilities

- online vocabularies are always up to date

• for tool builders

- no customization for individual vocabularies needed

- reuse of existing tools, modules

• risks

- Babylon scenario – too different conceptual domains clashing

- overwhelming of the system – system as single point of failure

- overwhelming of the user – too much information (too many vocabularies available)

11

Page 12: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

SKOS – ultra short primer

Simple Knowledge Organization System

http://www.w3.org/TR/skos-reference/

• SKOS knowledge structures consist of Concepts grouped in ConceptSchemes

• Concepts are identified by a URI

• Concepts have labels in 1 or more languages

skos:prefLabel@lang, skos:altLabel@lang => multilinguality

• Concepts can be documented with ‘notes’

• Concepts have mutual semantic relations

broader, narrower, related => taxonomy construction

• Concept in different ConceptSchemes can have matching relations

• Concepts can be part of multiple ConceptSchemes

12

Page 13: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

SKOS – example 13

Page 14: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

OpenSKOS

Vocabulary repository and service openskos.org

• data in SKOS format

• Peer to peer architecture

• RESTful API

• Linked Data

• Publication with upload and OAI-PMH

• Management using Interactive Dashboard

• Support for alignment

• Promotion of open database licenses

• And lately, vocabulary curation with built-in editor

14

Page 15: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

OpenSKOS

• developed within the Dutch cultural heritage project CATCHplus

• by a commercial company (Picturae), but open source

• currently 3 instances running: Meertens Institute, NISL, ICLTT (test phase)

(Picturae has another 7 instances running for their customers)

15

Page 16: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

CLAVAS – Vocabulary Service for CLARIN

Adaptation of OpenSKOS for CLARIN purposes = mainly a separate instance with specific data sets

16

currently > 2.500 entries

bootstrap

Page 17: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

ISOcat and CLAVAS

• automatically export all closed+simple data categories - perhaps even better to select manually - not all data categories !

• Third party applications would use - ISOcat for explain() function - CLAVAS for value(/entity)-lists (autocomplete)

17

Page 18: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

Open issues –next steps

• CLARIN/CLAVAS:

short term:

- update OpenSKOS instance at the Austrian Centre

- test synchronization of datasets via OAI-PMH with sister instance at Meertens

- continue curation work on Organization names

long term:

- use the Vocabulary Service with other infrastructure components (e.g. metadata editor)

- adopt further vocabularies

- especially work out how to integrate existing large ones / services => proxy?

• DARIAH

- collect candidate vocabularies/topics and people/groups in need of those

- decide if we try to use/adopt OpenSKOS - perhaps a pilot => be bold step up!

- pin down concrete scenarios (+ outcome) , where given vocabularies would be employed

- get rid of the NIH attitude

• Work out relation to Semantic Web activities

- transforming data to Linked Data (RDF)

- interlinking vocabularies/ontologies (dbpedia as the LOD-pivot)

18

Page 19: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

19

What do we want/need? Who does what?

Topic / Area existing Vocabularies People

Language

Organization CLAVAS-organizations, VIAF, GND, dbpedia

Resource Type, Format ? -> Taxonomy of DH Research Activities and Objects?

Genre / Topic / Subject LCSH, UDC, DDC, …

Geographica Geonames, Getty, dbpedia

Persons GND, Getty AAT, dbpedia

Page 20: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

20

Controlled Vocabularies - Outline

• Overview of related activities in different contexts (DARIAH, CLARIN, Digital Libraries)

• Controlled Vocabularies - potential usages and topic areas

• SKOS – a widely used W3C-standard for “vocabularies”

• OpenSKOS - Vocabulary Repository

• CLAVAS – Vocabulary System for CLARIN

• Next steps – Where do we want to go from here?

Page 21: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

Summary discussion

• concentrate on content generation rather than technical development

in adherence to general DARIAH strategy

• but also don‘t reinvent the wheel (data)

=> reuse / mediate / proxy existing vocabularies

however they are often too broad/general, never complete (VIAF, GND)

we need possibility to add concepts or in general

+ edit/curate vocabularies

• hence a dedicated Vocabulary Repository

allowing collaborative curation of vocabularies

one such system would be OpenSKOS (tried out in CLARIN)

• try to feed back/contribute back to the authority bodies (National Libraries)

this is principially possible, however a slow process

DARIAH could/should try to mediate / push, but this need to be a separate/parallel track

21

Page 22: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

Summary discussion

• many different topics/areas – but consider fundemental distinction

between concepts/taxonomies and entities (organizations, persons)

- Vocabulary Repository only for SKOS data

- dedicated tools for entities -> e.g. PDR Persons Data Repository (@BBAW)

• very closely related to Linked Data and Scholarly Methods Ontology tracks

two (interconnected) levels to work on:

1. Inventarization + harmonization

- bring existing vocabularies technically on common ground

2. Vocabulary (/ontology) alignment

- create links between concepts/entities in different vocabularies/ontologies

- proposition by IEG Mainz (on the example of place types): align vocabularies based on

features of concepts

• Vocabularies most asked for

- Names (Persons and Organizations)

- Places (Geographica)

22

Page 23: Controlled Vocabularies for Digital Humanities · for Digital Humanities . 2 Outline •Overview of related activities in different contexts •Controlled Vocabularies - potential

Vision

23