Upload
herbert-van-de-sompel
View
2.456
Download
0
Embed Size (px)
Citation preview
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Herbert Van de Sompel@hvdsomp
Los Alamos National Laboratory
Acknowledgments: Lyudmila Balakireva, Harihar Shankar, Ruben Verborgh
Access to DBpedia Versions using Memento and Triple Pattern Fragments
Miel Vander Sande@Miel_vds
Ghent University
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Memento Framework
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Memento LDOW 2010 Submission
Herbert Van de Sompel et al. (2010) An HTTP-Based Versioning Mechanism for Linked Datahttp://arxiv.org/abs/1003.3661
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Memento and Linked Data
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Memento and Linked Data
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Time-Series Analysis across DBpedia Versions
Data collected through “follow your nose” HTTP Navigation
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Storage
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: StorageCharacteristics
upload softwarecustom
upload time~ 24 hours per version
storage softwareMongoDB
storage space383 Gb for 10 versions
DBpedia versions10 versions: 2.0 through 3.9
number of triples~ 3 billion
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Subject-URI Access
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Subject-URI Access
http://dbpedia.mementodepot.org/memento/2009052/http://dbpedia.org/page/Oaxaca
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
First Generation DBpedia Archive: Subject-URI AccessCharacteristics
TimeGate softwarecustom
access typeSubject URI & datetime
external integrationcurrent DBpedia
clients• all clients: direct access to
Memento Subject-URI• Memento clients: datetime
negotiation with Subject-URI
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
DBpedia Archive @ LANL Since 2010
• Access based on Subject-URI (DBpedia Topic URI) only
• MongoDB storage• A blob per Subject-URI per version• Dynamically transformed to other RDF serializations• No updates since version 3.9 (2013) of DBpedia as a result of
scalability problems
!!!
!!!
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Affordable & Useful Linked Data Archives
• A Linked Data Archive consists of temporal snapshots of one or more Linked Data sets, whereby each temporal snapshot reflects the state of a Linked Data set at a specific moment or interval in time.
• How to make Linked Data Archives accessible in a manner that is • affordable/sustainable for the publisher• useful for the consumer
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive: Characteristics
General Characteristics Publisher Consumer
Availability
Bandwidth
Cost
Functionality
Interface Expressiveness
LOD Integration
Memento Support
Cross Time/Data
Verdict:• Publication perspective: $$$$• Access perspective: ++++
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Publishing
• The typical ways of publishing Linked Data on the Web:
• Subject URI access • Data dump• SPARQL endpoint
Let’s consider these from the perspective of Linked Data Archives, i.e. archival storage and access
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with Subject-URI Access
• For each temporal snapshot of a Linked Data set, and for each Subject in that snapshot, publish an RDF description (of the Subject) at a URI that is specific per snapshot/subject
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with Subject-URI Access: Characteristics
General Characteristics Publisher Consumer
Availability rather high rather high
Bandwidth ~ description ~ description
Cost rather low rather high
Functionality
Interface Expressiveness rather low
LOD Integration yes
Memento Support possible
Cross Time/Data follow your nose
Verdict:• Publication perspective: $$$$• Access perspective: ++++
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive Using Dumps
• Renders each temporal snapshot of a Linked Data set as a data dump that places all temporal dataset triples (as they were at a specific moment in time) into one or more files
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive Using Dumps: Characteristics
General Characteristics Publisher Consumer
Availability high high
Bandwidth high high
Cost low high
Functionality
Interface Expressiveness download dataset
LOD Integration no
Memento Support not possible
Cross Time/Data download various datasets
Verdict:• Publication perspective: $$$$• Access perspective: ++++
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with SPARQL Endpoint(s)
• For each temporal snapshot of a Linked Data set, supports arbitrary SPARQL queries. • Different architectural set-ups possible; no standard approach
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive Using SPARQL Endpoint(s): Characteristics
General Characteristics Publisher Consumer
Availability problematic problematic
Bandwidth ~ query ~ query
Cost high low
Functionality
Interface Expressiveness highly expressive
LOD Integration no
Memento Support hard
Cross Time/Data custom distributed queries
Verdict:• Publication perspective: $$$$• Access perspective: ++++
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Affordable & Useful Linked Data Archives
Linked Data Archive Type Publishing Consuming
Data Dump $$$$ ++++SPARQL Endpoint(s) $$$$ ++++Subject URI Access $$$$ ++++
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Fragments (Ghent U)
• Every Linked Data interface offers specific fragments of a Linked Data set
• A fragment is described by• Selector: what questions can I ask?• Controls: how do I get more fragments?• Metadata: helpful information for consumption?
• Each interface type comes with tradeoffs• cf. the analysis thus far
http://linkeddatafragments.org
Verborgh, R. et al. (2014) Querying datsets on the web with high availability. ISWC 2014http://ruben.verborgh.org/publications/verborgh_iswc_2014/
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Triple Pattern Fragments (Ghent U)
• Triple Pattern Fragments is a new interface with a different set of tradeoffs that are attractive from an archival perspective
http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Triple Pattern Fragments (Ghent U)
• Allows querying a Linked Data set according to?Subject ?Predicate ?Objectpatterns
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Triple Pattern Fragments (Ghent U)
Controls: Responses provide navigational help for clients• Based on emerging Hydra vocabulary for self-describing
Hypermedia-Driven Web APIs
Metadata: dataset info, estimated count (to aid client applications)
http://www.hydra-cg.com/spec/latest/core/
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Binary RDF Representation for Publication and Exchange (HDT)
http://www.w3.org/Submission/HDT/
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Binary RDF Representation for Publication and Exchange (HDT)
http://www.w3.org/Submission/HDT/
• Header-Dictionary-Triple (HDT) is a compact, binary representation of RDF datasets.
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Binary RDF Representation for Publication and Exchange (HDT)
http://www.w3.org/Submission/HDT/
• Able to represent massive data sets• Dictionary/Triples structure achieves
• rapid search for ?subject ?predicate ?object pattern• high compression rates
• Header provides metadata about the dataset
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
HDT Linked Data Archive with TPF Support
• For each temporal snapshot of a Linked Data set, generate an HDT serialization that provides access according to?subject ?predicate ?objectpatterns
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Archive with ?s?p?o Access: Characteristics
General Characteristics Publisher Consumer
Availability high high
Bandwidth ~ query ~ query
Cost low medium
Functionality
Interface Expressiveness better than subject-URI only
LOD Integration yes
Memento Support possible
Cross Time/Data follow your nose
Verdict:• Publication perspective: $$$$• Access perspective: ++++
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Affordable & Useful Linked Data Archives
Linked Data Archive Type Publishing Consuming
Data Dump $$$$ ++++SPARQL Endpoint(s) $$$$ ++++Subject URI Access $$$$ ++++HDT & TPF $$$$ ++++
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Storage
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: StorageCharacteristics
upload softwareHDT-CPP
upload time~ 4 hours per version
storage softwareHDT binary files
storage space70 Gb for 12 versions
DBpedia versions12 versions: 2.0 through 2015
number of triples~ 5 billion
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: ?s?p?o Query-URI Access
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: ?s?p?o Query-URI Access
http://fragments.mementodepot.org/dbpedia_3_8?subject=&predicate=http://dbpedia.org/ontology/birthPlace&object=http://dbpedia.org/resource/Ghent
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: ?s?p?o Query-URI Access
?s?p?o Query-URI Access
TimeGate URI http://fragments.mementodepot.org/timegate/dbpedia?subject={DBpediaURI}&predicate={DBpediaURI}&object={DBpediaURI}http://fragments.mementodepot.org/timegate/dbpedia?
subject=&predicate=&object=http://dbpedia.org/resource/GhentTimeMap URI not supported
Memento URI http://fragments.mementodepot.org/{DBpediaVersion}?subject={DBpediaURI}&predicate={DBpediaURI}&object={DBpediaURI}
http://fragments.mementodepot.org/dbpedia_3_0?subject=&predicate=&object=http://dbpedia.org/resource/Ghent
Further info http://mementoweb.org/depot/native/fragments/
Try it with Memento for Chrome – http://bit.ly/memento-for-chrome
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Subject-URI Access
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: Subject-URI Access
Subject-URI Access
TimeGate URI http://dbpedia.mementodepot.org/timegate/{DBpediaURI}
http://dbpedia.mementodepot.org/timegate/http://dbpedia.org/data/Ghent
TimeMap URI http://dbpedia.mementodepot.org/timemap/link/{DBpediaURI}http://dbpedia.mementodepot.org/timemap/link/http://dbpedia.org/data/Ghent
Memento URI http://dbpedia.mementodepot.org/{yyyymmdd}/{DBpediaURI}
http://dbpedia.mementodepot.org/20080103/http://dbpedia.org/data/GhentFurther info http://mementoweb.org/depot/native/dbpedia/
Try it with Memento for Chrome – http://bit.ly/memento-for-chrome
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Second Generation DBpedia Archive: AccessCharacteristics
TimeGate software① node.js LDF server 2.0.0② LDF js client
access type① ?s?p?o Query-URI & datetime② Subject-URI & datetime
external integration① DBpedia LDF server② current DBpedia
clients• all clients: direct access to
Mementos of Subject-URI and ?s?p?o Query-URI• Memento clients: datetime
negotiation with Subject-URI and
?s?p?o Query-URI
1
2
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Outline
• Prelude: Memento and Linked Data
• First Generation DBpedia Archive
• Devising Affordable/Useful Linked Data Archives
• Intermezzo: Triple Pattern Fragments (TPF)
• Intermezzo: Binary RDF Representation (HDT)
• Devising Affordable/Useful Linked Data Archives
• Second Generation DBpedia Archive
• Try this At Home
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
HDT Software (C++)
https://github.com/rdfhdt/hdt-cpp
• input data requires cleaning before processing, especially regarding URI characters• DBpedia data not clean• DBpedia v3.5 was not
successfully processed• No meaningful error
messages to help locate problems
• memory intensive• Kyoto Cabinet was used
to optimize storage requirement and speed during processing
• Java version exists but has memory problems
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
• Download the Triple Fragment Server code
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Fragment Server (Node.js)
https://github.com/LinkedDataFragments/Server.js
• provides ?s?p?o access to local and/or remote Linked Data sets
• supports HDT, Turtle files, N-Triple files, JSON-LD files, SPARQL endpoints, in-memory store, and BlazeGraph Linked Data sets
• version 2.0.0 (released March 31 2016) has built-in Memento support
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
• Download the Triple Fragment Server code
• Create the JSON config file for Memento
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Linked Data Fragment Server, Memento Configuration
https://github.com/LinkedDataFragments/Server.js/wiki/Configuring-Memento
• declare archival data set(s)• add datetime ranges for the
archival data set(s)• add a TimeGate • list the archival data set(s) for
which the TimeGate should support datetime negotiation
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Building a Linked Data Archive
• Convert the archival data set(s) to HDT using HDT-CPP
• Download the Triple Fragment Server code
• Create the JSON config file for Memento
• Run the server
Herbert Van de Sompel & Miel Vander SandeCNI Spring Meeting, San Antonio, TX, April 5 2016
Herbert Van de Sompel@hvdsomp
Los Alamos National Laboratory
Acknowledgments: Lyudmila Balakireva, Harihar Shankar, Ruben Verborgh
Access to DBpedia Versions using Memento and Triple Pattern Fragments
Miel Vander Sande@Miel_vds
Ghent University