6

Click here to load reader

Europeana Linked Open Data – data.europeana · Europeana Linked Open Data – data.europeana.eu ... The data.europeana.eu Linked Open Data pilot dataset contains open metadata on

  • Upload
    vucong

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Europeana Linked Open Data – data.europeana · Europeana Linked Open Data – data.europeana.eu ... The data.europeana.eu Linked Open Data pilot dataset contains open metadata on

Undefined 0 (0) 1 1IOS Press

Europeana Linked Open Data –data.europeana.euAntoine Isaac a, Bernhard Haslhofer ba Europeana, The Hague, The Netherlandsb Cornell Information Science, USA

Abstract.Europeana is a single access point to millions of books, paintings, films, museum objects and archival records that have been

digitized throughout Europe. The data.europeana.eu Linked Open Data pilot dataset contains open metadata on approximately2.4 million texts, images, videos and sounds gathered by Europeana. All metadata are released under Creative Commons CC0 andtherefore dedicated to the public domain. The metadata follow the Europeana Data Model and clients can access data either bydereferencing URIs, downloading data dumps, or executing SPARQL queries against the dataset. They can also follow the linksto external linked data sources, such as the Swedish cultural heritage aggregator (SOCH), GeoNames, the GEMET thesaurus, orDBPedia. The latest dataset release has been published in February 2012.

Keywords:, Europeana, Linked Data, Libraries, Cultural Heritage

1. Introduction

Europeana is a single access point to millions ofbooks, paintings, films, museum objects and archivalrecords that have been digitized throughout Europe,gathered from hundreds of individual cultural insti-tutions,1 with the help of dozens of data aggregatorsand providers. The Europeana Linked Open Data pilotdataset contains open metadata on approximately 2.4million texts, images, videos and sounds. These col-lections encompass more than 200 cultural institutionsfrom 15 countries. They cover a great variety of her-itage objects, such as a Slovenian version of O SoleMio from the National Library of Slovenia,2 or mem-ories on the herring business from the Tyne and WearArchives & Museums in Newcastle.3

1Around 1500 institutions have contributed to Europeana includ-ing renowned names such as the British Library in London, the Ri-jksmuseum in Amsterdam and the Louvre in Paris but also manysmaller cultural heritage organizations and libraries across Europe.

2http://data.europeana.eu/item/92056/BD9D5C6C6B02248F187238E9D7CC09EAF17BEA59

3http://data.europeana.eu/item/09405f/533F9A826CB038D02C05A9814CF97E5D1B49BBEE

Version 1.1 of the dataset, which is now availableat http://data.europeana.eu, has been re-leased in February 2012. The data is represented inthe Europeana Data Model (EDM), as we explain inmore detail in Section 4. It is served according tothe Linked Data principles: the described resourcesare addressable and dereferenceable by their URIs;especially, depending on its Accept parameter, anHTTP GET request against a data.europeana.

eu URI leads either to an HTML page on the Eu-ropeana portal for the object it identifies or to raw,machine-processable data on this object. See http://pro.europeana.eu/tech-details for ex-amples. The data is also available for bulk downloadat http://data.europeana.eu/download/,where the metadata are organized by dataset ver-sion, data provider, and RDF serialization format(RDF/XML, N-Triple). Clients can also execute struc-tured queries against the publicly available SPARQLendpoint: http://europeana-triplestore.isti.cnr.it/sparql.

0000-0000/0-1900/$00.00 c© 0 – IOS Press and the authors. All rights reserved

Page 2: Europeana Linked Open Data – data.europeana · Europeana Linked Open Data – data.europeana.eu ... The data.europeana.eu Linked Open Data pilot dataset contains open metadata on

2 Isaac and Haslhofer / Europeana Linked Open Data

2. Opening Cultural Data

data.europeana.eu is one of the results ofmore than one year of campaigning from Europeanato convince its community of opening up their meta-data.4 Currently it serves metadata coming from 8 dataaggregators who have reacted early and positively tothese efforts and agreed to publish their metadata un-der the Creative Commons CC0 Public Domain Ded-ication,5 which means that “[Anyone] can copy, mod-ify, distribute and perform the [data], even for commer-cial purposes, all without asking permission”.

Including only a subset of the total Europeana col-lection, which encompasses more than 20M objects atthe time of writing, is deliberate. In fact the first ver-sion of our dataset contained metadata for approxi-mately 3.5M objects but the licensing was not explicit.With 2.4M objects in version 1.1 we clearly favoredopenness of metadata over quantity.

At the moment, data.europeana.eu servesas a prototype for unlocking metadata and rights onmetadata, on a massive scale. In so-called hackathons(Hack4Europe6) developers can learn about this pro-totype and other access mechanisms to cultural data:Europeana also has an API and semantic mark-up onpages. We hope they will be used by third parties to de-velop innovative applications and services. This wouldin turn help to convince our partners to release moreopen data, next to other actions such as the release ofan animation that bridges Linked Data technology withOpen data policies7.

3. Data Anatomy

3.1. Coverage

As said, Europeana aggregates metadata about morethan 20M millions books, paintings, films, museumobjects, archival records and other types of culturalobjects. data.europeana.eu represents the “pub-lic domain” subset of the collections that can be ac-cessed through Europeana. It currently holds metadataabout 2,381,745 digitized objects, which were aggre-

4See Europeana’s new Data Exchange Agreement and actionsin support for open data at http://pro.europeana.eu/support-for-open-data

5http://creativecommons.org/publicdomain/zero/1.0/

6http://pro.europeana.eu/hackathons7http://vimeo.com/36752317

Table 1Open data contribution by country.

Country Number of objectsSpain 1,468,460Norway 248,987Austria 224,147Sweden 102,850Belgium 68,516Denmark 45,041Germany 40,729Slovenia 40,281United Kingdom 39,243Ireland 33,651Luxembourg 24,890Serbia 16,852Czech Republic 10,849Italy 9,088Portugal 8,161

gated from 8 aggregators representing 221 individualinstitutions from 15 countries across Europe. Pleasenote that the following statistics apply to this “open”subset of the total Europeana collection. We also ex-cluded data about 4 objects, which were added to thedataset for illustrative purposes.

In Table 1, which shows the “public domain” meta-data contribution by country, we can clearly see thatinstitutions from Spain, with 1.47M objects, are cur-rently the major data contributors.

While the 10 largest data providers (see Table 2)contribute 80% of all data (1,902,380 objects), the re-maining 20% (479,365 objects) are contributed by the211 smaller institutions or come from collections forwhich we do not have explicit information on indi-vidual data providers, as is currently the case for themajority of Swedish objects. Two data providers evencontribute only one single object to the current dataset.

These statistics show the importance of Europeanaand intermediate data aggregators that contribute toit, such as http://hispana.mcu.es or TheEuropean Film Gateway. The distribution ofdata aggregation efforts allows unifying the access toobjects from a huge diversity of institutions, with lim-ited effort. The resources it takes to consume dataavailable at an aggregator is much lower than the effortof setting up a solution at each data provider’s side.

3.2. Data gathering, linkage, and processing

The process of preparing the data for data.europeana.eu has been described in a separate

Page 3: Europeana Linked Open Data – data.europeana · Europeana Linked Open Data – data.europeana.eu ... The data.europeana.eu Linked Open Data pilot dataset contains open metadata on

Isaac and Haslhofer / Europeana Linked Open Data 3

Table 2The 10 largest data providers and their aggregators.

Aggregator Data Provider Number of objectsHispana Biblioteca Virtual de Prensa Histórica 956,496Norsk Kulturråd Fylkesarkivet i Sogn og Fjordane 248,368The European Library Österreichische Nationalbibliothek - Austrian National Library 223,847Hispana Galiciana: Biblioteca Digital de Galicia 136,473Hispana Repositorio Biblioteca virtual de Andalucía 100,775Hispana Gredos (Universidad de Salamanca, Spain) 65,567The European Film Gateway Det Danske Filminstitut 45,041Hispana Biblioteca Digital de Madrid 44,825The European Film Gateway Deutsches Filminstitut - DIF 40,729The European Library National and University Library of Slovenia 40,259

technical paper [1]. The prototype is deployed directlyon top of metadata that has already been gathered byEuropeana, either via OAI-PMH servers or from batchfiles. These metadata are formatted according to theEuropeana Semantic Elements (ESE) XML Schema,8

which is essentially a flat record structure that usesthe Dublin Core Element Set9 with some Europeanaextensions.

For the Europeana Linked Open Data set we con-verted this ESE metadata into the new Europeana DataModel (EDM),10 which has been developed with amuch stronger Linked Data focus. We thus defined amapping11 between ESE and EDM and implementedit as an executable ESE-EDM transformation library,12

which can be applied on the legacy ESE data.Parallel to this, we currently follow two strategies

for linking data.europeana.eu resources withother Web resources: first, we fetch semantic enrich-ment data that is being created by Europeana, afterit has ingested metadata from its data providers. Thisdata consists of links to four types of reference re-sources:13 Geonames for places (1.7M links), GEMETfor general topics (863K links), the Semium time on-tology for time periods (1.9M links), and DBpedia forpersons (1304 links). Since the enrichments are linksthey perfectly fit EDM and Linked Data approach, asseen in the following section. Second, as a simple ad-

8http://pro.europeana.eu/technical-requirements

9http://dublincore.org10http://pro.europeana.eu/edm-documentation11http://europeanalabs.eu/wiki/

EDMPrototypingTask1512https://github.com/behas/ese2edm13Accessible respectively at http://www.geonames.org,

http://www.eionet.europa.eu/gemet/, http://semium.org and http://dbpedia.org

hoc linking strategy, we rely on existing resource iden-tifiers that are part of the metadata and create links toother Linked Open Data services, which hold infor-mation about objects that are also served by data.europeana.eu: for the moment this only concernsthe Swedish cultural heritage aggregator (SOCH).

At the moment we manually execute the ESE-EDMtransformation and fetch the enrichment data when-ever we release a new dataset version and ingest theresulting RDF data into a separate triple store. Thisis clearly a temporary solution, only suitable for a pi-lot. In the long term, all human- and machine-readableEuropeana interfaces, including the Linked Data one,should be directly fed from one single data repository.

4. EDM data modeling patterns

For publishing metadata at data.europeana.eu, we “upgrade” ESE data to the Europeana DataModel (EDM), which has been developed by the Eu-ropeana community and is a more flexible and precisemodel. It offers the opportunity to attach every state-ment to the specific resource it applies to and also re-flects some basic form of data provenance. The mainEDM requirements include:

– distinguish between a “provided item” (painting,book) and digital representations

– distinguish between an item and the metadatarecord describing it

– allow ingesting multiple records for a same item,containing potentially contradictory statementsabout it

EDM allows to represent different perspectives on agiven cultural object. It also enables to represent com-plex, especially hierarchically structured objects as in

Page 4: Europeana Linked Open Data – data.europeana · Europeana Linked Open Data – data.europeana.eu ... The data.europeana.eu Linked Open Data pilot dataset contains open metadata on

4 Isaac and Haslhofer / Europeana Linked Open Data

the archive or library domains. Finally, it allows us torepresent contextual information, in the form of en-tities (places, agents, time periods) explicitly repre-sented in the data and connected to a cultural object.

In the following we explain in more detail the ba-sic structure of EDM networked resources, which isshown in Figure 1, together with the properties we ex-pect to be applied to their instances. Further informa-tion, including dereferencable example resources areavailable at http://pro.europeana.eu/web/guest/tech-details.

4.1. Item (Provided Cultural Heritage Object)

Item resources (typed as Provided Cultural HeritageObject (CHO)) represent objects (painting, book, etc.)for which institutions provide representations to be ac-cessed through Europeana. Provided CHO URIs arethe main entry points in data.europeana.eu. AProvided CHO is the hub of the network of relevant re-sources. When applicable (see Section 3.2), the URIsfor these objects link, via owl:sameAs statements,to other linked data resources about the same object.In our pilot, no descriptive metadata (creator, subject,etc.) is directly attached to object URIs. It is insteadattached to the proxies that represent a view of the ob-ject, from a specific institution’s perspective (either aEuropeana provider or Europeana itself, see below).Depending on the feedback received during this pilot,we may change this and duplicate all the descriptivemetadata at the level of the item URI. Such an optionis costly in terms of data verbosity, but it would en-able easier access to metadata, for data consumers lessconcerned about provenance.

4.2. Provider’s proxy

Proxies originate from the OAI-ORE model [2] andare used as subjects of descriptive statements (cre-ator, subject, date of creation, etc.) for the item, whichare contributed by a Europeana provider. They en-able the separation of different views for a same re-source, in the context of different aggregations. Thisallows us to distinguish the original metadata for theobject from the metadata that is created by Europeana.Descriptive properties that apply to these proxies, aswe can generate them from ESE metadata (see Sec-tion 3.2) mostly come from Dublin Core. Proxies areconnected to the item they represent a facet of, us-ing the ore:proxyFor property. They are attachedto the aggregation that contextualizes them, using the

ore:proxyIn relationship. This design was cho-sen because of the lack of support for named graphs(aka “quadruples”) in the RDF standard. OAI-ORE in-troduced Proxies in order to support referencing re-sources in the context of a specific graph. Eventu-ally, named graphs may be natively supported by RDF,which could supersede the Proxy construct.

4.3. Provider’s aggregation

These resources provide data related to a Euro-peana provider’s gathering of digitized representa-tions and descriptive metadata for an item. Theyare related to digital resources about the item, bethey files directly representing it (edm:object andedm:isShownBy) or web pages showing the objectin context (edm:isShownAt). They may also pro-vide controlled rights information applying to these re-sources (edm:rights). Finally, provenance data isgiven in statements using edm:provider (the directprovider to Europeana in the data aggregation chain)or edm:dataProvider (the cultural institution thatcurates the object). The aggregation is connected to theitem using the edm:aggregatedCHO property.

4.4. Europeana’s proxy

Europeana proxies are the second type of proxiesserved at data.europeana.eu. They provide ac-cess to the metadata created by Europeana for a givenitem, distinct from the original metadata from theprovider. Here one can find edm:year statements, in-dicating a normalized date associated with the object.Proxies also have statements that link them to places,concepts, persons and periods from external datasets,as mentioned in section 3.2. Finally, a proxy is con-nected to the item it represents a view of, using theore:proxyFor property, as well as to the aggrega-tion that contextualizes it, using ore:proxyIn.

4.5. Europeana’s aggregation

A Europeana aggregation bundles together the re-sult of all data creation and aggregation efforts fora given item. It aggregates the provider’s aggre-gation (using ore:aggregates), which in turnwill connect to the provider’s proxy. Next to theprovider aggregation, one can find the digitized re-sources europeana.eu serves for the item, i.e.,an object page (edm:landingPage) and a thumb-nail (using a combination of edm:hasView and

Page 5: Europeana Linked Open Data – data.europeana · Europeana Linked Open Data – data.europeana.eu ... The data.europeana.eu Linked Open Data pilot dataset contains open metadata on

Isaac and Haslhofer / Europeana Linked Open Data 5

Fig. 1. Basic structure of EDM networked resources.

Europeana MetadataData Provider Metadata

ore:aggregatedCHO

ore:proxyIn

ore:proxyFor

ore:aggregatedCHO

ore:aggregates

ore:proxyFor

ore:proxyIn

ore:Aggregationeulod:aggregation/provider/

00000/2AAA3C6DF09F9FAA6F951FC4C4A9CC80B5D4154

eulod:item/00000/2AAA3C6DF09F9FAA6F951

FC4C4A9CC80B5D4154

ore:Proxyeulod:proxy/provider/

00000/2AAA3C6DF09F9FAA6F951FC4C4A9CC80B5D4154

edm:EuropeanaAggregationeulod:aggregation/europeana/

00000/2AAA3C6DF09F9FAA6F951FC4C4A9CC80B5D4154

ore:Proxyeulod:proxy/europeana/

00000/2AAA3C6DF09F9FAA6F951FC4C4A9CC80B5D4154

foaf:thumbnail). The Europeana proxy is alsoconnected to this aggregation, as mentioned above.

4.6. Resource map

OAI-ORE Resource maps are constructs for indi-cating meta-level statements about the creation andpublication of ORE data (ORE aggregations andtheir aggregated resources). We are exploring theiruse as a contextualization mechanism for the Euro-peana aggregation. Maps are connected to an itemthey are about using foaf:primaryTopic, andto its corresponding Europeana aggregation usingore:describes. They sum up the provenance ofmetadata using dc:creator and dc:contributorstatements. Crucially, they also indicate, in a machine-readable way, that the data.europeana.eu RDFdataset is provided under the CC0 open license.

4.7. Vocabulary usage and interoperability

EDM is well connected to other established ontolo-gies, most notably the Dublin Core metadata elements,SKOS and OAI-ORE. We have tried to directly re-useelements from these vocabularies whenever this waspossible. When not, the newly introduced elementsare semantically aligned to these ontologies, either us-ing simple RDFS class and property specialization orOWL axioms. Such alignments allow for example to

connect EDM to CIDOC-CRM, an important vocabu-lary for the museum domain.14

5. Known Shortcomings and discussion

Europeana is often confronted with the critique thatits “data quality” could be enhanced. Especially, the“internal connectivity” of the dataset is currently verylow. We have Provided CHO - aggregation - proxy re-lationships that come with the EDM model, but no “se-mantic” links between the items, or the proxies thatrepresent them.

This is partly because the ESE metadata format,which is based on simple text fields, conceals the rich-ness of the original metadata: many providers use con-textual resources, which could be fed into Europeanaand provide internal links. This includes, amongst oth-ers, concepts from shared domain thesauri, or placeresources, which are already used in the descriptionfor different objects in a collection or even across col-lections. This contextual information is lost when themetadata is transferred to Europeana in ESE. We hopeto obtain such valuable information from providers,when they can submit metadata in EDM. Europeanais currently working on it and we have case studies15

that demonstrate how this can be done and what arethe benefits. In the Amsterdam Museum Linked Open

14http://www.cidoc-crm.org15http://pro.europeana.eu/edm-case-studies

Page 6: Europeana Linked Open Data – data.europeana · Europeana Linked Open Data – data.europeana.eu ... The data.europeana.eu Linked Open Data pilot dataset contains open metadata on

6 Isaac and Haslhofer / Europeana Linked Open Data

Data prototype16, for instance, richer original metadatahas been converted to EDM and published as LinkedOpen Data, together with its companion thesaurus andauthority file.

For achieving “external connectivity”, we currentlyrely on Europeana’s enrichment process (see Sec-tion 3.2), which generates semantic links from spe-cific fields in the ESE data (e.g., Dublin Core’sdc:subject), but that information is not recorded.As a result, we do not know whether, say, a given cityis the subject of an item or its place of production. Forour RDF data we had to use an EDM property thatmerely expresses that the item is “generally linked”to that place. Because it has to deal with very het-erogeneous collections, Europeana is bound, for themoment, to using simple data enrichment techniques,which we know will bring errors. Still, we can do bet-ter at handling the provenance of enrichments to obtaina better data grain.

Another issue is the transition to the network modelof EDM, which lead to quite verbose data. We maywant to “hide” this complexity when it is not neededor reveal the full complexity and power of EDM insuccessive steps, which should make the full pictureeasier to understand for data providers and consumersalike. This important lesson learnt has directly influ-enced how EDM should be used for data ingestion intoEuropeana, i.e., with only a limited part of the patternused for our pilot. But it is still open, whether and howdata.europeana.eu should handle the complex-ity differently, as a data publication service.

Finally, we needed to start addressing design issuesthat the existing EDM specification had not touched atall. The first one is the minting of HTTP Uniform Re-source Identifiers (URIs) for all EDM resources in aLinked Data environment. We realized that many pat-terns were possible, each corresponding to slightly dif-ferent priorities in terms of representing the underlyingmodel or enabling certain HTTP-based services. Thesecond issue is the representation of provenance for themetadata served on data.europeana.eu, includ-ing such things as attribution or licenses. All the prove-nance information available at Europeana could berepresented. The way it has been represented, though,may be revisited in the light of ongoing discussions inthe community.

16http://semanticweb.cs.vu.nl/lod/am — a paperhas been submitted to this Semantic Web Journal special call

6. Summary and Future Work

With data.europeana.eu we created a LinkedData prototype for Europeana, which is a single accesspoint to millions of cultural digital objects that havebeen digitized throughout Europe. At the moment, itserves metadata of 2.4M objects under the CreativeCommons CCO public domain dedication. The dataoriginate from aggregators and providers who have re-acted early and positively to Europeana’s new Data Ex-change Agreements. One future work goal is to con-vince more data providers to accept these agreementsand to increase the number of objects included in Eu-ropeana’s Linked Open Data service.

The exposed metadata are represented in the Eu-ropeana Data Model (EDM), which has been devel-oped by the Europeana community and allows to rep-resent different perspectives and basic provenance in-formation on a given cultural object. We expect thatfuture data.europeana.eu dataset releases re-flect the lessons we have learned with respect to themodel’s complexity and identification of digital ob-jects. We will also investigate how to align the EDMwith other efforts dealing with provenance on the Web,such as PROV Model developed by the W3C Prove-nance Working Group17.

Increasing Europeana’s internal and external con-nectivity by means of links between Web resources isanother major goal. This can be achieved by convinc-ing data providers to deliver their original rich meta-data instead of flat ESE records and by applying namedentity linkage techniques in the data ingestion phase.

References

[1] B. Haslhofer and A. Isaac, data.europeana.eu - The EuropeanaLinked Open Data Pilot, International Conference on DublinCore and Metadata Applications, 2011, The Hague, NL.

[2] C. Lagoze and H. van de Sompel (eds.),http://www.openarchives.org/ore/1.0/primer.html, Availableat: http://www.openarchives.org/ore/1.0/primer.html, Accessed: 2012-05-20.

17http://www.w3.org/TR/prov-primer/