55
RPI 28/07/2011 1 With the help of the Datalift team And the support of the French National Research Agency The Datalift Project Ontologies, Datasets, Tools and Methodologies to Publish and Interlink ★★★★★ Datasets François Scharffe University of Montpellier, LIRMM, INRIA [email protected] @lechatpito

20110728 datalift-rpi-troy

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: 20110728 datalift-rpi-troy

RPI 28/07/2011 1

With the help of the Datalift teamAnd the support of the French National Research Agency

The Datalift Project Ontologies, Datasets, Tools and Methodologies to Publish and Interlink ★★★★★ Datasets

François ScharffeUniversity of Montpellier, LIRMM, [email protected]@lechatpito

Page 2: 20110728 datalift-rpi-troy

State of government open data

(September 2010…)

You’re here

Page 3: 20110728 datalift-rpi-troy

(June 2011)

State of government open data

Page 4: 20110728 datalift-rpi-troy

May 2007

April 2008 September 2008

March 2009

September 2010

Linking Open Data

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Page 5: 20110728 datalift-rpi-troy

Link the world

Linked data

Page 6: 20110728 datalift-rpi-troy

W3C

Page 7: 20110728 datalift-rpi-troy

W3C

Page 8: 20110728 datalift-rpi-troy

Tim Berners Lee, http://www.w3.org/DesignIssues/LinkedData.html

principles§ Use the RDF format

§ Use URI to name things

§ Use HTTP URI HTTP (URL) so that one can look up those names

§ Give information (HTML, RDF) when dereference those links

§ Include in this information other URIs pointing to other data to enable discovery

Page 9: 20110728 datalift-rpi-troy

goal of datalift

from raw published datato interconnected semantic data

Page 10: 20110728 datalift-rpi-troy

phase 1: opening the data

develop a plateform easing the publication

Page 11: 20110728 datalift-rpi-troy

Published and interlinked data on the Web

Applications

Interconnexion

Publication infrastructure

Data convertion

Vocabulary selection

Raw data

Welcome aboard the data lift

Page 12: 20110728 datalift-rpi-troy

Example publication process

GeographyOil industryequipment

SPARQL

Content Negociation

URI de-referencing

Environmental, weather, geological datasets

Page 13: 20110728 datalift-rpi-troy

SemWebPro 18/01/2011 13

1st floor - Selection

Page 14: 20110728 datalift-rpi-troy

Vocabularies of my friends...

Ø What is a (good) vocabulary for linked data ?

§ Usability criterias

Simplicity, visibility, sustainability, integration, coherence …

Ø Differents types of vocabularies

§ metadata, reference, domain, generalist …

§ The pillars of Linked Data : Dublin Core, FOAF, SKOS

Ø Good and less good practices

§ Ex : Programmes BBC vs legislation.gov.uk

§ Vocabulary of a Friend : networked vocabularies

Ø Linguistic problems

§ Existing vocabularies are in English at 99%

§ Terminological approach :which vocabularies for « Event » « Organization »

Page 15: 20110728 datalift-rpi-troy

15

Did you say « vocabulary »

… And why not « ontology »?

§ « schema » or « metadata schema »?

§ Or « model » (data ? World ?)

Ø All these terms are used and justifiable

They are all « vocabularies »

§ They define types of objects (or classes)and the properties (or attributes) atttached to these objects.

§ Types and attributes are logically definedand named using natural language

§ A (semantic) vocabularyis an explicit formalizationof concepts existing in natural language

Page 16: 20110728 datalift-rpi-troy

Vocabularies for linked data

ØAre meant to describe resources in RDF

ØAre based on one of the standard W3C language§ RDF Schema (RDFS)

• For vocabulaires without too much logical complexity

§ OWL • For more complex ontological constructs

§ These two languages are compatible (almost)

ØThe can be composed « ad libitum »§ One can reuse a few elements of a vocabulary

§ The original semantics have to be followed

Page 17: 20110728 datalift-rpi-troy

What makes a good vocabulary ?

Ø A good vocabulary is a used vocabulary

§ Data published on CKAN give an idea of vocabulary usage

§ Exemple : list of datasets using FOAF http://xmlns.com/foaf/0.1/

Ø Other usability criterias

§ Simplicity and readability in natural language

§ Elements documentation (definition in natural language)

§ Visibility and sustainability of the publication

§ Flexibility and extensibility

§ Sémantic integration (with other vocabularies)

§ Social integration (with the user community)

Page 18: 20110728 datalift-rpi-troy

A vocabulary is also a community

ØBad (but common) practice● Build a lonely vocabulary

– For example as a research project– Without basing it on any existing vocabulary

§ To publish it (or not) and then to forget about it

§ Not to care about its users

ØA good vocabulary has an organic life

§ Users and use cases

§ Revisions and extensions

§ Like a « natural » vocabulary

Page 19: 20110728 datalift-rpi-troy

Types of vocabularies

Ø Metadata vocabularies

§ Allowing to annotate other vocabularies

• Dublin Core, Vann, cc REL, Status, Void

Ø Reference vocabularies

§ Provide « common » classes and properties

• FOAF, Event, Time, Org Ontology

Ø Domain vocabularies

§ Specific to a domain of knowledge

• Geonames, Music Ontology, WildLife Ontology

Ø « general » vocabularies

§ Describe « everything » at an arbitrary detail level

• DBpedia Ontology, Cyc Ontology, SUMO

Page 20: 20110728 datalift-rpi-troy

Vocabulary of a Friend

Øhttp://www.mondeca.com/foaf/voaf

ØA simple vocabulary...

ØTo represent interconnexions between vocabularies

ØA unique entry point to vocabularies and Datasets of the linked-data cloud Linked Data Cloud

ØOngoing work in Datalift

Page 21: 20110728 datalift-rpi-troy

SemWebPro 18/01/2011 21

2nd floor - Conversion

Page 22: 20110728 datalift-rpi-troy

Reference datasets, URI design

● Providing reference datasets for the French ecosystem: geographical, topological, statistical, political

● Providing URI design guidelines● Opaque or transparent URIs ?● Usage of accents in URIs● Distinction between

Resources: http://dbpedia.org/resource/Paris

Documents: http://dbpedia.org/page/Paris

Data: http://dbpedia.org/data/Paris

… All served with content negociation

Page 23: 20110728 datalift-rpi-troy

Many tools exist !

csv2rdf4lod

Page 24: 20110728 datalift-rpi-troy

Define a standard transformation from a relational database to RDF

The relational schema is used :• Cells of a tuple produce triples with a common subject

• Each cell produces an object

• Different tables of a same database are thus linked together

Standard automatic translation of any relational schema to RDF, based on the database Dump

Then we can SPARQL CONSTRUCT to adapt vocabularies and URIs.

Direct Mapping from relational database to RDF

Page 25: 20110728 datalift-rpi-troy

Exemple

25

Credits Ivan Herman: http://ivan-herman.name/2010/11/19/my-first-mapping-from-direct-mapping/

Page 26: 20110728 datalift-rpi-troy

Exemple

26

Credits Ivan Herman: http://ivan-herman.name/2010/11/19/my-first-mapping-from-direct-mapping/

@base <http://book.example/> .<Book/ID=0006511409X#_> a <Book> ; <Book#ISBN> "0006511409X" ; <Book#Title> "The Glass Palace" ; <Book#Year> "2000" ; <Book#Author> <Author/ID=id_xyz#_> .

<Author/ID=id_xyz#_> a <Author> ; <Author#ID> "id_xyz" ; <Author#Name> "Ghosh, Amitav" ; <Author#Homepage> "http://www.amitavghosh.com" .

Simple result but not satisfaying:● we want to use different vocabulary terms (like a:name)● the direct mapping produces literal objects most of the time, except when there is

a “jump” from one table to another● the resulting graph should use a blank node for the author, which is not the case

in the generated graph

Page 27: 20110728 datalift-rpi-troy

Exemple

27

CONSTRUCT { ?id a:title ?title ; a:year ?year ; a:author _:x . _:x a:name ?name ; a:homepage ?hp .}WHERE { SELECT (IRI(fn:concat("http://...",?isbn)) AS ?id) ?title ?year ?name (IRI(?homepage) AS ?hp){ ?book a <Book> ; <Book#ISBN> ?isbn ; <Book#Title> ?title ; <Book#Year> ?year ; <Book#Author> ?author . ?author a <Author> ; <Author#Name> ?name ; <Author#Homepage ?homepage . }}

Solution : use SPARQL 1.1 Construct queries

Page 28: 20110728 datalift-rpi-troy

SemWebPro 18/01/2011 28

3rd floor - Publication

Page 29: 20110728 datalift-rpi-troy

Datalift Platform

V1 to be released in September with expected features :

- Modular architecture

- Raw convertion module: Relational DB (DirectMapping approach, CSV, XML (based on a user specified XSLT transformation)

- Selection module : LOV repository, automatic candidate vocabulary proposal using ontology matching from the raw data schema, vocabulary navigation tool, vocabulary usage metrics, sample data for each vocab

- Convertion (according to the schema) : RDF2RDF Convertion module based on SPARQL construct (manual editing), Vocabulary mapping facility (textual)

- Interlinking and Alignment : A Silk interface -- Integration of the alignment API

- Publication Sesame API, informational vs non-informational resource management.

29

Page 30: 20110728 datalift-rpi-troy

Datalift Platform

Page 31: 20110728 datalift-rpi-troy

SemWebPro 18/01/2011 31

4th floor - Interconnexion

Page 32: 20110728 datalift-rpi-troy

32

Web of data and links

- Without links no web but data silos

- Many types of links : the edges of the Web of data graph are labeled

- Some links are built during the selection phase : reference datasets

- We study here a particular type of links : equivalence links.

Page 33: 20110728 datalift-rpi-troy

33

owl:sameAs

- points to a logical identity between two resource

- The quality of the available links is not always optimal

Other types of links : owl:differentFrom, rdfs:seeAlso

Page 34: 20110728 datalift-rpi-troy

34

How to link data ?

Page 35: 20110728 datalift-rpi-troy

35

How to link data ?

Page 36: 20110728 datalift-rpi-troy

36

How to link data ?

Page 37: 20110728 datalift-rpi-troy

37

How to link data ?

Page 38: 20110728 datalift-rpi-troy

38

How to link data ?

Page 39: 20110728 datalift-rpi-troy

39

Example Silk link specification<Silk> <Prefix id="rdfs" namespace= "http://www.w3.org/2000/01/rdf-schema#" /> <Prefix id="dbpedia" namespace= "http://dbpedia.org/ontology/" /> <Prefix id="gn" namespace= "http://www.geonames.org/ontology#" />

<DataSource id="dbpedia"> <EndpointURI>http://demo_sparql_server1/sparql </EndpointURI> <Graph>http://dbpedia.org</Graph> </DataSource>

<DataSource id="geonames"> <EndpointURI>http://demo_sparql_server2/sparql </EndpointURI> <Graph>http://sws.geonames.org/</Graph> </DataSource> <Thresholds accept="0.9" verify="0.7" /> <Output acceptedLinks="accepted_links.n3" verifyLinks="verify_links.n3" mode="truncate" />

<Interlink id="cities"> <LinkType>owl:sameAs</LinkType> <SourceDataset dataSource="dbpedia" var="a"> <RestrictTo> ?a rdf:type dbpedia:City </RestrictTo> </SourceDataset> <TargetDataset dataSource="geonames" var="b"> <RestrictTo> ?b rdf:type gn:P </RestrictTo> </TargetDataset> <LinkCondition> <AVG> <Compare metric="jaroSimilarity"> <Param name="str1" path="?a/rdfs:label" /> <Param name="str2" path="?b/gn:name" /> </Compare> <Compare metric="numSimilarity"> <Param name="num1" path="?a/dbpedia:populationTotal" /> <Param name="num2" path="?b/gn:population" /> </Compare> </AVG> </LinkCondition> </Interlink></Silk>

Page 40: 20110728 datalift-rpi-troy

40

Where to find links ?

Page 41: 20110728 datalift-rpi-troy

41

Towards automatic interlinking

We have seen some of the Silk spec fields could be avoided

- Using alignments between ontologies

- Detecting discriminating properties

- Indicating comparison methods by attaching metadata to ontologies

-> … ongoing work in Datalift

Page 42: 20110728 datalift-rpi-troy

SemWebPro 18/01/2011 42

5th floor - Applications

Page 43: 20110728 datalift-rpi-troy

phase 2: publishing datasets

validate the plateform with real data

Page 44: 20110728 datalift-rpi-troy

Research objectives§ Methods and metrics for selecting schemas§ Tradeoff between specific and generic vocabularies§ Data conversion and URI design patterns§ Automatic data interlinking§ Provenance and rights management§ Integration, architecture and scalability

Page 45: 20110728 datalift-rpi-troy

W3C ©

Who ?

2010-2013

Page 46: 20110728 datalift-rpi-troy

http://labs.mondeca.com/dataset/lov/index.html

Page 47: 20110728 datalift-rpi-troy

http://labs.mondeca.com/vocab/voaf/

Page 48: 20110728 datalift-rpi-troy

The french wider landscape

● Regards Citoyens

● Direction de l’information légale et administrative

● Fédération des parcs naturels régionaux de France

● Eurostat

● Cities of Montpellier, Bordeaux, Rennes, …

● Data Publica

● EtatLab

Page 49: 20110728 datalift-rpi-troy
Page 50: 20110728 datalift-rpi-troy
Page 51: 20110728 datalift-rpi-troy

LIRMM D2R Serverhttp://data.lirmm.fr/nosdeputes/

Page 52: 20110728 datalift-rpi-troy
Page 53: 20110728 datalift-rpi-troy
Page 54: 20110728 datalift-rpi-troy

DA

TALI

FT

next floor: « the web of data »

Page 55: 20110728 datalift-rpi-troy

55

Credits

This presentation was realized thanks to the work of the Datalift team.It can be freely distributed under Creative Commons licence BY-NC-SA 3.0