27
Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December 2011 11/02/11

Data Sets, Vocabularies and Tools

  • Upload
    boris

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

Data Sets, Vocabularies and Tools. Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December 2011. Work Plan View WP4. 24. 12. 0. 6. 18. 30. 36. 42. 48. D4.1 Assembly and maintenance of the PlanetData data set catalogue. D4.2 Best practices on how to provide - PowerPoint PPT Presentation

Citation preview

Page 1: Data Sets, Vocabularies and Tools

Data Sets, Vocabularies and Tools

Pablo N. MendesFreie Universität Berlin

1st year reviewLuxembourg, December 2011

11/02/11

Page 2: Data Sets, Vocabularies and Tools

18 24 30 366 120

FUBFUB

42 48D4.1 Assembly and maintenance of the PlanetData data set catalogue

D4.1 Assembly and maintenance of the PlanetData data set catalogue D4.2 Best practices on

how to provideself-describing data

D4.2 Best practices on how to provideself-describing data

KITKIT

KITKIT

Work Plan View WP4

UPMUPM

D4.3 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal

D4.3 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal

D4.4 Data quality benchmark datasetD4.4 Data quality benchmark dataset

D4.5 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal

D4.5 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal

Task 4.4Assembly and maintenance of a catalogue of data provisioning tools

Task 4.3Development of best practices for providing self-describing data

Task 4.2Community-driven creation and maintenance of vocabularies

Task 4.1Assembly and maintenance of the PlanetData data set catalogue

Page 3: Data Sets, Vocabularies and Tools

18 24 30 366 120

Task 5.1Assembly and maintenance of PlanetData technology catalogue

Task 5.2Development of best practices of large-scale data management infrastructures

D5.3 PlanetData data management toolscatalogue and access portal

D5.3 PlanetData data management toolscatalogue and access portal

EPFLEPFL

42 48

D5.1PlanetData data management toolscatalogue and access portal

D5.1PlanetData data management toolscatalogue and access portal

D5.2 Best practices on how to deploy tools on large-scale infrastructures

D5.2 Best practices on how to deploy tools on large-scale infrastructures

KITKIT

Work Plan View WP5

Page 4: Data Sets, Vocabularies and Tools

Summary

WP4

Assembly and maintenance of the PlanetData data set, vocabularies and tools catalogue;

Community-driven creation and maintenance of vocabularies;

Development of best practices;

WP5

Assembly and maintenance of the PlanetData technology catalogue;

Best practices for large-scale data management infrastructure;

Page 5: Data Sets, Vocabularies and Tools

Deliverables in Year 1

D 4.1• Data Sets Catalog• Vocabularies Catalog

D 5.1• Data Management Tools Catalog

Page 6: Data Sets, Vocabularies and Tools

Data Sets Catalog

• Where to maintain the catalog?

• How to catalog?

• What to catalog?

• How to provide access for humans and machines?

• How to organize a community around the catalog?

Page 7: Data Sets, Vocabularies and Tools

Repository: TheDataHub.org

Maintained by Open Knowledge Foundation (OKF) and world-wide open data community

Widely used catalog• Dec 1st 2012: has 2418 datasets, 314 LOD

Features of the portal: • Tagging, Rating, Feedback,

Discussions, Groups

Page 8: Data Sets, Vocabularies and Tools

Cataloguing Process

• Planet Data Editor

• Collected a list of new datasets → 49 new entries

• Updated existing entries (537 edits)

• Crowdsourcing: data providers and third parties

• Public call for action to mailing lists, OKFN blog

• Supported the community contributions

• Quality Assurance

• Tools to support cataloguing (validator, auto-complete)

• Joint work with LATC

Page 9: Data Sets, Vocabularies and Tools

Catalog Metadata QuickRef

What? package name, title, url tag:lod topic shortname format-*

Who?author || maintainerpublished by producerprovenance metadata license

When?versionlast updated

Why?package description

Where to find?example URIdownloads/dumpsSPARQL endpoint

How much?tripleslinks:* (outlinks)namespace (inlinks)vocab mappings

Page 10: Data Sets, Vocabularies and Tools

http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKANmetainformation

How are datasets described?

Catalog Metadata

Resources:• example URIs• SPARQL endpoint• RDF Dumps• Sitemaps, VoID files

Page 11: Data Sets, Vocabularies and Tools

Cataloguing process overview

Page 12: Data Sets, Vocabularies and Tools

Catalog Entry Validator

Checks levels of metadata completeness

Step-by-step annotation instructions

Already checks some quality indicatorse.g. availability, provenance, access methods

http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php

Page 13: Data Sets, Vocabularies and Tools

CKAN Entry Validator (2)

Page 14: Data Sets, Vocabularies and Tools

Auto-completion scripts

For the entries that pass the validator, we can auto-complete metadata with information such as:• Number of triples• Links to other sources• Vocabularies used• Quality indicators

Page 15: Data Sets, Vocabularies and Tools

Catalog Access Portal

For machines• CKAN API (continuously improved by OKFN)• VOID descriptions for LOD group (will be

continuously improved in cooperation with LATC)

For humans• LOD Cloud Diagram • State of the LOD Report

Page 16: Data Sets, Vocabularies and Tools

LOD Cloud Diagram

Page 17: Data Sets, Vocabularies and Tools

LOD Cloud Diagram (zoom in)

Page 18: Data Sets, Vocabularies and Tools

State of the LOD Cloud

Triples by domain Links by domain

Domain# of datasets

Triples % (Out-)Links %

Media 25 1,841,852,061 5.82 % 50,440,705 10.01 % Geographic 31 6,145,532,484 19.43 % 35,812,328 7.11 % Government 49 13,315,009,400 42.09 % 19,343,519 3.84 % Publications 87 2,950,720,693 9.33 % 139,925,218 27.76 % Cross-domain 41 4,184,635,715 13.23 % 63,183,065 12.54 % Life sciences 41 3,036,336,004 9.60 % 191,844,090 38.06 % User-generated content

20 134,127,413 0.42 % 3,449,143 0.68 %

295 31,634,213,770 503,998,829

http://www4.wiwiss.fu-berlin.de/lodcloud/state/

Page 19: Data Sets, Vocabularies and Tools

State of the LOD Cloud (2)

SPARQL Endpoint: 68.14%RDF Dumps: 39.66%Provide provenance:36.63 %Provide licensing:17.84%

vocabulary use:

Page 20: Data Sets, Vocabularies and Tools

Vocabularies Catalog

• Based on BTC Dataset (2.1 billion triples)• Shows vocabulary usage in practice• Executed on a 54 node Hadoop cluster

• Access portal:• Searchable• URI Lookup• Top usage statistics

Hosted at http://vocab.cc

Page 21: Data Sets, Vocabularies and Tools

Top Classes per Dataset

Page 22: Data Sets, Vocabularies and Tools

Top Properties per Dataset

Page 23: Data Sets, Vocabularies and Tools

Vocabularies Catalog

vocab.cc search query results

vocab.cc URI Lookup Results

Page 24: Data Sets, Vocabularies and Tools

Tools Catalog

• Initial focus on tools from the consortium

• Currently 15 tools

Entry for Global Sensor Networks (GSN)

Available from planet-data.eu

Page 25: Data Sets, Vocabularies and Tools

Tools Description

•Textual description• What is it?• Documentation• Publications• Requirements• License• Contact person/mailing list• Organization• Events

•Tags•Produce•Publish•Consume•Provisioning

Page 26: Data Sets, Vocabularies and Tools

Names of Tools in the Catalog

CumulusRDF

D2R

DBpedia Spotlight

GSN (Global Sensor Networks)

Geometry2RDF

LDIF

LDSpider (Linked Data Spider)

LarKC (Large Knowledge Collider)

MonetDB

NOR2O

R2O&ODEMapster

OKKAM

Pubby

R2R

S2O

Silk

Page 27: Data Sets, Vocabularies and Tools

Tools Catalog

Related: LATC Tools Catalog• 11 tools• 5 tools in both, 10 new tools in PlanetData

Proposal for next year:• Join catalogs at linkeddata.org• Jointly maintain catalog until LATC finishes• Build a community → people can add their

own tools• Afterwards PlanetData takes over and

maintains the catalog for another 2 years