Innovative methods for data integration: Linked Data and NLP

ARIADNE is funded by the European Commission's Seventh Framework Programme

Innovative methods for data integration: Linked Data and NLP

Douglas Tudhope Hypermedia Research Group

University of South Wales (USW)

Linked Data and NLP

Linked Data (LD) + Natural Language Processing (NLP)

Two technologies that open up new possibilities for semantic integration of archaeological datasets and fieldwork reports.

Eg

• cross searching

• meta research

• reinterpretation of previous work

This presentation

• Overview

• Illustrative early examples

- a flavour of progress and challenges to date

• NLP of grey literature (English – Dutch)

• Mapping between multilingual vocabularies

What is Linked Data?

“The Web enables us to link related documents. Similarly it enables us to link related data. The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. Key technologies that support Linked Data are URIs (a generic means to identify entities or concepts in the world), HTTP (a simple yet universal mechanism for retrieving resources, or descriptions of resources), and RDF (a generic graph-based data model with which to structure and link data that describes things in the world).”

http://linkeddata.org/faq

http://data.archaeologydataservice.ac.uk/page/10.5284/1000389





Linked Data • Making RDF format data available via the web • Data expressed in RDF • Using (HTTP) URIs as identifiers for things • When someone looks up a URI, provide useful

information (including links to other things) • Will it work for cultural heritage...? Yes

– http://data.ordnancesurvey.co.uk/ – http://collection.britishmuseum.org/ – http://data.archaeologydataservice.ac.uk/

http://linkeddata.org/faq and Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems, 5(3), 1-22. Also see http://data.gov.uk/linked-data

http://data.ordnancesurvey.co.uk/

http://collection.britishmuseum.org/

http://data.archaeologydataservice.ac.uk/


http://data.gov.uk/linked-data



Standards are key

• Linked data rests upon layers of technological standards.

• Within archaeology, vocabulary standards are seen as a potential solution to the current fieldwork situation where isolated silos of data impede sharing, cross search, comparison and reinterpretation.

• Standard representations of thesauri and other vocabularies

• Standard ways of mapping between vocabularies and using vocabularies within ontology frameworks

ARIADNE Linked Data aims

Support provision, management and use of LD

in the Integrated Infrastructure

• Operational LD management service (based on a triple store) working with the ARIADNE Registry

• Supporting tools for the linking of infrastructure, such as dictionaries, glossaries and thesauri. Tools for semantic enrichment

• Exemplar applications exploiting the LD

• Advise ARIADNE data providers in creation and publishing of LD

Semantic enrichment

• Semantic enrichment requires an

infrastructure for linking

- dictionaries, glossaries, thesauri, ontologies.

• Perhaps some linking by hand possible between closely associated datasets but not scalable

• Critical for enrichment are concepts from major vocabularies and ontological classes that can act as hubs in a web of archaeological data

NLP methods

a) Rule-based systems - a pipeline of cascaded software elements using domain knowledge and vocabularies together with domain-independent linguistic syntax b) Machine learning (supervised) has less domain-dependency but depends on the existence of a training set. ARIADNE currently exploring complementary use of both methods, either in combination or sequentially in a pipeline. Looking to make use of relevant CLARIN and other resources.

NLP ongoing case study

Rule-based pilot system

• Archaeological vocabularies from English Heritage (EH) and Rijksdienst Cultureel Erfgoed (RCE).

• Building upon previous work producing semantic enrichment of ADS grey literature via archaeological thesauri and corresponding CIDOC CRM ontology

• Next slides show examples from pilot English and Dutch pipelines. Three entity types shown: Green (Physical Object), Purple (Material) and Red (Time Appellation).

NLP on English grey literature

NLP on Dutch grey literature

Pilot rule based system

Early stages but promising semantic enrichment

for a range of archaeological concepts

eg

• artefacts (“vessel/vaatwerk”)

• materials ("charchoal/houtskool”)

• monuments ("Castel/Kasteel”)

• contexts ("grave/graf")

• temporal entities

both numerical ("3025 BC /3025 v.Chr")

and time appellations ("Neolithic/Neolithicum”)

NLP challenges Generalisation of the English rule based techniques to Dutch (in this case) faces various challenges: • different set of vocabularies (archaeology has very specific terminology and important to use it)

• differences in language characteristics such as compound noun forms eg • “beslagplaat” - both “beslag” and “plaat” known vocabulary • “aardewerkmagering” - aardewerk (pottery) known but “magering” not known • Current work investigating gazetteers operating on part matching, in

order to overcome the ‘whole word’ restriction.

• Mapping between vocabularies essential to actually use the results!

You say potato, I say tomato…

• Multiple datasets, multiple organisations, multiple languages

• Unification of data structures may be possible, BUT… – Incompatible terminology hinders cross search

and prevents greater interoperability

– Indexing using text is ambiguous, leading to incorrect search results

– Applications attempting to reuse data must all individually tackle the same problems

• E.g. Find all the iron age post holes…

• The problem here is in the use of text to convey meaning – whereas the underlying concepts are actually the same

• use of concept-based controlled vocabularies and mapping between them

Feature Period

Post-hole IRON AGE

Posthole |ron age

POST HOLE Iron age?

POSTHLOLE EARLY IRON AGE

POST HOLE (POSSIBLE)

250 BC

POSTHOLES C 500-200 B.C.

Mapping links: many-to-many vs. hub architecture

Number of bidirectional links when linking between multiple thesauri

Multilingual Mapping Experiment

• Explored the potential of a mediating structure (a ‘mapping spine’) to support search in the ARIADNE Registry across metadata expressed via partner vocabularies in different languages.

• The mapping spine was expressed as a poly-hierarchical structure using RDF (SKOS).

• Experimental mappings from partner vocabulary resources (DAI, DANS/RCE, FASTI, EH, ICCD) to the concept identifiers of the central spine were expressed in RDF using standard SKOS mapping relationships.

Example mappings from ICCD (Italian) vocabulary to mapping spine • @prefix iccd: < http://www.iccd.beniculturali.it/monuments/> . • @prefix skos: <http://www.w3.org/2004/02/skos/core#> . • @prefix aat: <http://vocab.getty.edu/aat/> . • • # NOTE: iccd URIs have been invented for this example • iccd:catacomba skos:prefLabel "catacomba"@it ; • skos:closeMatch aat:300000367 . • iccd:cenotafio skos:prefLabel "cenotafio"@it ; • skos:closeMatch aat:300007027 . • iccd:cimitero skos:prefLabel "cimitero"@it ; • skos:closeMatch aat:300266755 . • iccd:colombario skos:prefLabel "colombario"@it ; • skos:closeMatch aat:300000370 . • iccd:dolmen skos:prefLabel "dolmen"@it ; • skos:closeMatch aat:300005934 . • iccd:mausoleo skos:prefLabel "mausoleo"@it ; • skos:closeMatch aat:300005891 . • iccd:menhir skos:prefLabel "menhir"@it ; • skos:closeMatch aat:300006985 . • iccd:necropoli skos:prefLabel "necropoli"@it ; • skos:closeMatch aat:300000372 . • iccd:sepolcreto-rupestre skos:prefLabel "sepolcreto rupestre"@it ; • skos:closeMatch aat:300387008 . • iccd:tomba skos:prefLabel "tomba"@it ; • skos:closeMatch aat:300005926 . Google Translate (https://translate.google.com/) was used to determine English translations of the ICCD terminology, these terms were then also manually mapped to AAT concepts

https://translate.google.com/

Multilingual Mapping Experiment

• Results from an example query using a concept identifier for “cemetery” from a partner vocabulary are shown, where the search is programmed to locate vocabulary concepts from any partner vocabulary mapped into the mapping spine at that level or below (expanded to more specific concepts).

• The different partner vocabularies can be seen in the prefix to each concept (eg iccd is the Italian ICCD Istituto Centrale per il Catalogo e la Documentazione archaeological site type vocabulary).

Multilingual Mapping Results

Cross searching and expanding the mapped vocabularies • The results show that a query on a concept

from one partner (Fasti) vocabulary has located (multilingual) concepts originating from five different controlled vocabularies, all related via the mapping spine (AAT) structure.

• The query has also included semantic expansion to more specific concepts.

Standards again

• The experiment is only possible because of the standards based approach that has been followed by ARIADNE and which underpins Linked Data.

• In the next phase of the Registry development, it would be a straightforward query to find all collection items indexed using any of these multilingual, multi-vocabulary concepts.

Contact Information Douglas Tudhope Hypermedia Research Group Faculty of Computing, Engineering and Science University of South Wales Pontypridd CF37 1DL Wales, UK [email protected] Related links http://www.heritagedata.org http://data.archaeologydataservice.ac.uk http://hypermedia.research.glam.ac.uk/kos/STELLAR/ http://hypermedia.research.southwales.ac.uk/kos/

http://www.heritagedata.org

http://www.heritagedata.org



http://hypermedia.research.glam.ac.uk/kos/STELLAR/

http://hypermedia.research.southwales.ac.uk/kos/

http://hypermedia.research.southwales.ac.uk/kos/

Disclaimer

ARIADNE is a project funded by the European Commission under the Community’s Seventh Framework Programme, contract no. FP7-INFRASTRUCTURES-2012-1-313193.

The views and opinions expressed in this presentation are the sole responsibility of the authors and do not necessarily reflect the views of the European Commission.

Data & Analytics

Innovative methods for data integration: Linked Data and NLP