Querying Linked Data with SPARQL

ISWC 2009 Tutorial "How to Consume Linked Data on the Web"

QueryingLinked Data

withSPARQL


Brief Introduction to SPARQL

● SPARQL: Query Language for RDF data● Main idea: pattern matching

● Describe subgraphs of the queried RDF graph● Subgraphs that match your description yield a result● Mean: graph patterns (i.e. RDF graphs /w variables)

?vhttp://.../Volcano

rdf:type


Brief Introduction to SPARQLQueriedgraph:

?vhttp://.../Volcano

rdf:type

http://.../Mount_Baker http://.../Volcanordf:type

"1880"

p:lastEruption

htp://.../Mount_Etna

rdf:type

?v

http://.../Mount_Bakerhttp://.../Mount_Etna

Results:


SPARQL Endpoints

● Linked data sources usually provide aSPARQL endpoint for their dataset(s)

● SPARQL endpoint: SPARQL query processing service that supports the SPARQL protocol*

● Send your SPARQL query, receive the result

* http://www.w3.org/TR/rdf-sparql-protocol/


SPARQL Endpoints

More complete list: http://esw.w3.org/topic/SparqlEndpoints

Data Source Endpoint Address

DBpedia http://dbpedia.org/sparql

Musicbrainz http://dbtune.org/musicbrainz/sparql

U.S. Census http://www.rdfabout.com/sparql

Semantic Crunchbase http://cb.semsol.org/sparql


Accessing a SPARQL Endpoint

● SPARQL endpoints: RESTful Web services● Issuing SPARQL queries to a remote SPARQL

endpoint is basically an HTTP GET request to the SPARQL endpoint with parameter query

GET /sparql?query=PREFIX+rd... HTTP/1.1Host: dbpedia.orgUser-agent: my-sparql-client/0.1

URL-encoded stringwith the SPARQL query


Query Results Formats

● SPARQL endpoints usually support different result formats:● XML, JSON, plain text

(for ASK and SELECT queries)● RDF/XML, NTriples, Turtle, N3

(for DESCRIBE and CONSTRUCT queries)


PREFIX dbp: <http://dbpedia.org/ontology/>PREFIX dbpprop: <http://dbpedia.org/property/>

SELECT ?name ?bday WHERE {?p dbp:birthplace <http://dbpedia.org/resource/Berlin> ; dbpprop:dateOfBirth ?bday ; dbpprop:name ?name .} name | bday ------------------------+------------ Alexander von Humboldt | 1769-09-14 Ernst Lubitsch | 1892-01-28 ...

Query Results Formats


<?xml version="1.0"?><sparql xmlns="http://www.w3.org/2005/sparql-results#"> <head> <variable name="name"/> <variable name="bday"/> </head> <results distinct="false" ordered="true"> <result> <binding name="name"> <literal xml:lang="en">Alexander von Humboldt</literal> </binding> <binding name="bday"> <literal datatype="http://www.w3.org/2001/XMLSchema#date">1769-09-14</literal> </binding> </result> <result> <binding name="name"> <literal xml:lang="en">Ernst Lubitsch</literal> </binding> <binding name="bday"> <literal datatype="http://www.w3.org/2001/XMLSchema#date">1892-01-28</literal> </binding> </result>  </results></sparql>

http://www.w3.org/TR/rdf-sparql-XMLres/


{

"head": { "link": [], "vars": ["name", "bday"] }, "results": { "distinct": false, "ordered": true, "bindings": [

{ "name": { "type": "literal", "xml:lang": "en",

"value": "Alexander von Humboldt" } , "bday": { "type": "typed-literal",

"datatype": "http://www.w3.org/2001/XMLSchema#date",

"value": "1769-09-14" } },

{ "name": { "type": "literal", "xml:lang": "en",

"value": "Ernst Lubitsch" } , "bday": { "type": "typed-literal",

"datatype": "http://www.w3.org/2001/XMLSchema#date", "value": "1892-01-28" }

},

// ... ] }

}

http://www.w3.org/TR/rdf-sparql-json-res/


Query Result Formats

● Use the ACCEPT header to request the preferred result format:

GET /sparql?query=PREFIX+rd... HTTP/1.1Host: dbpedia.orgUser-agent: my-sparql-client/0.1Accept: application/sparql-results+json


Query Result Formats

● As an alternative some SPARQL endpoint implementations (e.g. Joseki) provide an additional parameter out

GET /sparql?out=json&query=... HTTP/1.1Host: dbpedia.orgUser-agent: my-sparql-client/0.1



● More convenient: use a library● Libraries:

● SPARQL JavaScript Library http://www.thefigtrees.net/lee/blog/2006/04/sparql_calendar_demo_a_sparql.html

● ARC for PHPhttp://arc.semsol.org/

● RAP – RDF API for PHPhttp://www4.wiwiss.fu-berlin.de/bizer/rdfapi/index.html



● Libraries (cont.):● Jena / ARQ (Java) http://jena.sourceforge.net/● Sesame (Java) http://www.openrdf.org/● SPARQL Wrapper (Python)

http://sparql-wrapper.sourceforge.net/● PySPARQL (Python)

http://code.google.com/p/pysparql/



● Example with Jena / ARQ:

import com.hp.hpl.jena.query.*;

String service = "..."; // address of the SPARQL endpointString query = "SELECT ..."; // your SPARQL queryQueryExecution e = QueryExecutionFactory.sparqlService( service, query );ResultSet results = e.execSelect();while ( results.hasNext() ) {

QuerySolution s = results.nextSolution();// …

}e.close();


● Querying a single dataset is quite boring

compared to:● Issuing SPARQL queries over multiple datasets

● How can you do this?

1. Issue follow-up queries to different endpoints

2. Querying a central collection of datasets

3. Build store with copies of relevant datasets

4. Use query federation system


Follow-up Queries

● Idea: issue follow-up queries over other datasets based on results from previous queries

● Substituting placeholders in query templates


String s1 = "http://cb.semsol.org/sparql";String s2 = "http://dbpedia.org/sparql";

String qTmpl = "SELECT ?c WHERE{ <%s> rdfs:comment ?c }";

String q1 = "SELECT ?s WHERE { ...";QueryExecution e1 = QueryExecutionFactory.sparqlService(s1,q1);ResultSet results1 = e1.execSelect();while ( results1.hasNext() ) { QuerySolution s1 = results.nextSolution(); String q2 = String.format( qTmpl, s1.getResource("s"),getURI() ); QueryExecution e2= QueryExecutionFactory.sparqlService(s2,q2); ResultSet results2 = e2.execSelect(); while ( results2.hasNext() ) { // ... } e2.close();}e1.close();

Find a list of companiesfiltered by some criteria and

return DBpedia URIs of them


Follow-up Queries

● Advantage:● Queried data is up-to-date

● Drawbacks:● Requires the existence of a SPARQL endpoint for

each dataset● Requires program logic● Very inefficient


Querying a Collection of Datasets

● Idea: Use an existing SPARQL endpoint that provides access to a set of copies of relevant datasets

● Example:● SPARQL endpoint by OpenLink SW over a majority

of datasets from the LOD cloud at: http://lod.openlinksw.com/sparql


Querying a Collection of Datasets

● Advantage:● No need for specific program logic

● Drawbacks:● Queried data might be out of date● Not all relevant datasets in the collection


Own Store of Dataset Copies

● Idea: Build your own store with copies of relevant datasets and query it

● Possible stores:● Jena TDB http://jena.hpl.hp.com/wiki/TDB● Sesame http://www.openrdf.org/● OpenLink Virtuoso http://virtuoso.openlinksw.com/● 4store http://4store.org/● AllegroGraph http://www.franz.com/agraph/● etc.


Own Store of Dataset Copies

● Advantages:● No need for specific program logic● Can include all datasets● Independent of the existence, availability, and

efficiency of SPARQL endpoints

● Drawbacks:● Requires effort to set up and to operate the store● Ideally, data sources provide RDF dumps; if not?● How to keep the copies in sync with the originals?● Queried data might be out of date


Federated Query Processing

● Idea: Querying a mediator whichdistributes subqueries torelevant sources andintegrates the results

???

?



● Instance-based federation● Each thing described by only one data source● Untypical for the Web of Data

● Triple-based federation● No restrictions● Requires more distributed joins

● Statistics about datasets requires (both cases)



● DARQ (Distributed ARQ) http://darq.sourceforge.net/● Query engine for federated SPARQL queries● Extension of ARQ (query engine for Jena)● Last update: June 28, 2006



● Semantic Web Integrator and Query Engine(SemWIQ) http://semwiq.sourceforge.net/● Actively maintained by Andreas Langegger



● Advantages:● No need for specific program logic● Queried data is up to date

● Drawbacks:● Requires the existence of a SPARQL endpoint for

each dataset● Requires effort to set up and configure the mediator


In any case:

● You have to know the relevant data sources● When developing the app using follow-up queries● When selecting an existing SPARQL endpoint over

a collection of dataset copies● When setting up your own store with a collection of

dataset copies● When configuring your query federation system

● You restrict yourself to the selected sources


In any case:

● You have to know the relevant data sources● When developing the app using follow-up queries● When selecting an existing SPARQL endpoint over

a collection of dataset copies● When setting up your own store with a collection of

dataset copies● When configuring your query federation system

● You restrict yourself to the selected sourcesThere is an alternative:

Remember, URIs link to data


AutomatedLink Traversal


Automated Link Traversal

● Idea: Discover further data by looking-up relevant URIs in your application

● Can be combined with the previous approaches


Link Traversal BasedQuery Execution

● Applies the idea of automated link traversal to the execution of SPARQL queries

● Idea:● Intertwine query evaluation with traversal of RDF links● Discover data that might contribute to query results

during query execution

● Alternately:● Evaluate parts of the query● Look up URIs in intermediate solutions

Queried data



SELECT ?c ?u WHERE {

<http://mymovie.db/movie2449> mov:filming_location ?c .

?c geo:statistics ?cStats .

?cStats stat:unempRate ?u . }

Queried data

● Example:Return unemployment rate of the countries in which the movie http://mymovie.db/movie2449 was filmed.







Queried data

http://mymovie.db/movie2449

?







Queried data







Queried data

...

<http://mymovie.db/movie2449> mov:filming_location <http://geo.../Italy> .

...







Queried data

...

<http://mymovie.db/movie2449> mov:filming_location <http://geo.../Italy> .

...

http://geo.../Italy

?loc







Queried data

http://geo.../Italy

?loc

http://geo.../Italy

?







Queried data

http://geo.../Italy

?loc

http://geo.../Italy

?






?cStats stat:unempRate ?u . } http://geo.../Italy

?loc

Queried data







...

<http://geo.../Italy> geo:statistics <http://example.db/stat/IT> .

...

http://geo.../Italy

?loc

Queried data







?loc

http://geo.../Italy http://stats.db/../it

?stat?loc

...

<http://geo.../Italy> geo:statistics <http://example.db/stat/IT> .

... Queried data







?loc

http://geo.../Italy http://stats.db/../it

?stat?loc

● Proceed with this strategy(traverse RDF links during query execution)

Queried data


● Advantages:● No need to know all data sources in advance● No need for specific programming logic● Queried data is up to date● Independent of the existence of SPARQL endpoints

provided by the data sources

● Drawbacks:● Not as fast as a centralized collection of copies● Unsuitable for some queries● Results might be incomplete



Implementations

● Semantic Web Client library (SWClLib) for Javahttp://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/

● SWIC for Prolog http://moustaki.org/swic/


Implementations

● SQUIN http://squin.org● Provides SWClLib functionality as a Web service● Accessible like a SPARQL endpoint● Public SQUIN service at:

http://squin.informatik.hu-berlin.de/SQUIN/● Install package: unzip and start● Convenient access with SQUIN PHP tools:

$s = 'http:// …'; // address of the SQUIN service$q = new SparqlQuerySock( $s, '… SELECT ...' );$res = $q->getJsonResult(); // or getXmlResult()


Real-World Examples

SELECT DISTINCT ?author ?phone WHERE {

?pub swc:isPartOf <http://data.semanticweb.org/conference/eswc/2009/proceedings> .

?pub swc:hasTopic ?topic . ?topic rdfs:label ?topicLabel .

FILTER regex( str(?topicLabel), "ontology engineering", "i" ) .

?pub swrc:author ?author .

{ ?author owl:sameAs ?authorAlt }

UNION

{ ?authorAlt owl:sameAs ?author }

?authorAlt foaf:phone ?phone .

}

2

297

16

1min 30sec

# of query results

# of retrieved graphs

# of accessed servers

avg. execution time

Returnphone numbers of authors

of ontology engineering papersat ESWC'09.

Technology

Querying Linked Data with SPARQL