Shape Expressions: An RDF validation and transformation language
Eric Prud'hommeauxWorld Wide Web
ConsortiumMIT, Cambridge, MA, USA
Harold SolbrigMayo Clinic
USACollege of Medicine, Rochester,
MN, USA
Jose Emilio Labra GayoWESO Research groupUniversity of Oviedo
This talk in 1 slide
Motivating example: Represent issues and users in RDF...and validate that data
Shape Expressions = simple language to:Describe the topology of RDF dataValidate if an RDF graph matches a given shape
Shape expressions can be extended with actionsPossible application: transform RDF into XML
Motivating example
Represent in RDF a issue tracking systemIssues are reported by users on some dateIssues have some status (assigned/unassigned)Issues can also be reproduced on some date by users
User Issue
User__ foaf:name: xsd:stringfoaf:givenName: xsd:string*foaf:familyName: xsd:stringfoaf:mbox: IRI
Issue__ :status: (:Assigned :Unassigned):reportedOn: xsd:date:reproducedOn: xsd:date
:reportedBy 0..*1
:reproducedBy0..* 0..1
0..*
0..1
:related
E-R Diagram
...and several constraints
A user: - has full name or several given names and one
family name- can have one mbox
A Issue- has status Assigned/Unassigned- is reported by a user- is reported on a date- can be reproduced by a user on a
date- is related to other issues
Example data in RDF:Issue1 :status :Unassigned ; :reportedBy :Bob ; :reportedOn "2013-01-23"^^xsd:date ; :reproducedBy :Thompson.J ; :reproducedOn "2013-01-23"^^xsd:date .
:Bob foaf:name "Bob Smith" ; foaf:mbox <mail:[email protected]> .
:Thompson.J foaf:givenName "Joe", "Joseph" ; foaf:familyName "Thompson" ; foaf:mbox <mail:[email protected]> .
:Issue2 :status :Checked ; :reportedBy :Issue1 ; :reportedOn 2014 ; :reproducedBy :Tom .
:Tom foaf:name "Tom Smith", "Tam" .
:Anna foaf:givenName "Anna" ; foaf:mbox 23.
Problem statementWe want to detect possible errors in RDF like:
Issues without statusIssues with status different of Assigned/UnassignedIssues reported by something different to a userIssues reported on a date with a non-date typeIssues reproduced on a date before the reported dateUsers without mboxUsers with 2 namesUsers with with a name of type integer...lots of other errors...
Q: How can we describe RDF data to be able to detect those errors?A: Our proposal = Shape Expressions
Shape Expressions - UsersA user can have either:
one foaf:name or one or more foaf:givenName and one foaf:familyName all of them must be of type xsd:string
A user can have one foaf:mbox with value any IRI
<UserShape> { ( foaf:name xsd:string | foaf:givenName xsd:string+ , foaf:familyName xsd:string ), foaf:mbox IRI ?} The example uses compact syntax
Shape Expressions can also be represented in RDF
Shape Expressions - Issues
<IssueShape> { :status (:Assigned :Unassigned), :reportedBy @<UserShape>, :reportedOn xsd:date, ( :reproducedBy @<UserShape> , :reproducedOn xsd:date )?, :related @<IssueShape>*}
Issues :status must be either :Assigned or :UnassignedIssues are :reportedBy a user Issues are :reportedOn a xsd:dateA issue may be :reproducedBy a user and :reproduceOn an xsd:dateA issue can be :related to several issues
Full exampleprefix : <http://example.org/>prefix xsd: <http://www.w3.org/2001/XMLSchema#>prefix foaf: <http://xmlns.com/foaf/0.1/>
<UserShape> { ( foaf:name xsd:string | foaf:givenName xsd:string+ , foaf:familyName xsd:string ), foaf:mbox IRI ?}
<IssueShape> { :status (:Assigned :Unassigned), :reportedBy @<UserShape>, :reportedOn xsd:date, ( :reproducedBy @<UserShape> , :reproducedOn xsd:date )?, :related @<IssueShape>*}
Online Shape Expressions validators: http://www.w3.org/2013/ShEx http://rdfshape.weso.es
FAQ: Why not use SPARQL?
<UserShape> { ( foaf:name xsd:string | foaf:givenName xsd:string+ , foaf:familyName xsd:string ), foaf:mbox IRI ?}
<IssueShape> { :status (:Assigned :Unassigned), :reportedBy @<UserShape>, :reportedOn xsd:date, ( :reproducedBy @<UserShape> , :reproducedOn xsd:date )?, :related @<IssueShape>*}
1234567891011121314151617
CONSTRUCT { ?IssueShape :hasShape <IssueShape> . ?UserShape :hasShape <UserShape> .} { { SELECT ?IssueShape { ?IssueShape :status ?o . } GROUP BY ?IssueShape HAVING (COUNT(*)=1)} { SELECT ?IssueShape { ?IssueShape :status ?o . FILTER ((?o = :Assigned || ?o = :Unassigned)) } GROUP BY ?IssueShape HAVING (COUNT(*)=1)} { SELECT ?IssueShape (COUNT(*) AS ?IssueShape_c0) { ?IssueShape :reportedBy ?o . } GROUP BY ?IssueShape HAVING (COUNT(*)=1)} { SELECT ?IssueShape { ?IssueShape :reportedBy ?o .
FILTER ((isIRI(?o) || isBlank(?o))) } GROUP BY ?IssueShape HAVING (COUNT(*)=1)} { SELECT ?IssueShape (COUNT(*) AS ?IssueShape_c1) { { SELECT ?IssueShape ?UserShape { ?IssueShape :reportedBy ?UserShape . FILTER (isIRI(?UserShape) || isBlank(?UserShape)) } } { SELECT ?UserShape WHERE { { { SELECT ?UserShape { ?UserShape foaf:name ?o . } GROUP BY ?UserShape HAVING (COUNT(*)=1)} { SELECT ?UserShape { ?UserShape foaf:name ?o . FILTER ((isLiteral(?o) && datatype(?o) = xsd:string))} GROUP BY ?UserShape HAVING (COUNT(*)=1)
123456789101112131415161718192021222324252627282930
} UNION { { SELECT ?UserShape (COUNT(*) AS ?UserShape_c0) { ?UserShape foaf:givenName ?o . } GROUP BY ?UserShape HAVING (COUNT(*)>=1)} { SELECT ?UserShape (COUNT(*) AS ?UserShape_c1) { ?UserShape foaf:givenName ?o . FILTER ((isLiteral(?o) && datatype(?o) = xsd:string))} GROUP BY ?UserShape HAVING (COUNT(*)>=1)} FILTER (?UserShape_c0 = ?UserShape_c1) { SELECT ?UserShape { ?UserShape foaf:familyName ?o . } GROUP BY ?UserShape HAVING (COUNT(*)=1)} { SELECT ?UserShape { ?UserShape foaf:familyName ?o . FILTER ((isLiteral(?o) && datatype(?o) = xsd:string))} GROUP BY ?UserShape HAVING (COUNT(*)=1)}} } GROUP BY ?UserShape HAVING (COUNT(*) = 1)} { SELECT ?UserShape (COUNT(*) AS ?UserShape_c2) { ?UserShape foaf:mbox ?o . } GROUP BY ?UserShape HAVING (COUNT(*)<=1)} { SELECT ?UserShape (COUNT(*) AS ?UserShape_c3) { ?UserShape foaf:mbox ?o .
FILTER (isIRI(?o)) } GROUP BY ?UserShape HAVING (COUNT(*)<=1)} FILTER (?UserShape_c2 = ?UserShape_c3)
313233343536373839404142434445464748495051525354555657585960
FILTER (?UserShape_c2 = ?UserShape_c3) } GROUP BY ?IssueShape } FILTER (?IssueShape_c0 = ?IssueShape_c1) OPTIONAL { ?IssueShape :reportedBy ?IssueShape_UserShape_ref0 . FILTER (isIRI(?IssueShape_UserShape_ref0) || isBlank(?IssueShape_UserShape_ref0)) } { SELECT ?IssueShape { ?IssueShape :reportedOn ?o . } GROUP BY ?IssueShape HAVING (COUNT(*)=1)} { SELECT ?IssueShape { ?IssueShape :reportedOn ?o . FILTER ((isLiteral(?o) && datatype(?o) = xsd:date))} GROUP BY ?IssueShape HAVING (COUNT(*)=1)} { { SELECT ?IssueShape (COUNT(*) AS ?IssueShape_c2) { ?IssueShape :reproducedBy ?o . } GROUP BY ?IssueShape} { SELECT ?IssueShape (COUNT(*) AS ?IssueShape_c3) { ?IssueShape :reproducedBy ?o . FILTER ((isIRI(?o) || isBlank(?o))) } GROUP BY ?IssueShape} FILTER (?IssueShape_c2 = ?IssueShape_c3) { SELECT ?IssueShape (COUNT(*) AS ?IssueShape_c5) { ?IssueShape :reproducedOn ?o . } GROUP BY ?IssueShape} { SELECT ?IssueShape (COUNT(*) AS ?IssueShape_c6) { ?IssueShape :reproducedOn ?o . FILTER ((isLiteral(?o) && datatype(?o) = xsd:date))} GROUP BY ?IssueShape} FILTER (?IssueShape_c5 = ?IssueShape_c6)
616263646566676869707172737475767778798081828384858687888990
FILTER (?IssueShape_c2=0 && ?IssueShape_c5=0 || ?IssueShape_c2>=1&&?IssueShape_c2<=1 && ?IssueShape_c5>=1&&?IssueShape_c5<=1) } { SELECT ?IssueShape (COUNT(*) AS ?IssueShape_c7) { ?IssueShape :related ?o . } GROUP BY ?IssueShape} { SELECT ?IssueShape (COUNT(*) AS ?IssueShape_c8) { ?IssueShape :related ?o . } GROUP BY ?IssueShape}FILTER (?IssueShape_c7 = ?IssueShape_c8) { SELECT ?UserShape WHERE { { { SELECT ?UserShape { ?UserShape foaf:name ?o . } GROUP BY ?UserShape HAVING (COUNT(*)=1)} { SELECT ?UserShape { ?UserShape foaf:name ?o . FILTER ((isLiteral(?o) && datatype(?o) = xsd:string)) } GROUP BY ?UserShape HAVING (COUNT(*)=1)} } UNION { { SELECT ?UserShape (COUNT(*) AS ?UserShape_c0) { ?UserShape foaf:givenName ?o . } GROUP BY ?UserShape HAVING (COUNT(*)>=1)} { SELECT ?UserShape (COUNT(*) AS ?UserShape_c1) { ?UserShape foaf:givenName ?o . FILTER ((isLiteral(?o) && datatype(?o) = xsd:string))} GROUP BY ?UserShape HAVING (COUNT(*)>=1)} FILTER (?UserShape_c0 = ?UserShape_c1) { SELECT ?UserShape { ?UserShape foaf:familyName ?o .
919293949596979899100101102103104105106107108109110111112113114115116117118119120
} GROUP BY ?UserShape HAVING (COUNT(*)=1)} { SELECT ?UserShape { ?UserShape foaf:familyName ?o . FILTER ((isLiteral(?o) && datatype(?o) = xsd:string)) } GROUP BY ?UserShape HAVING (COUNT(*)=1)}} } GROUP BY ?UserShape HAVING (COUNT(*) = 1)} { SELECT ?UserShape (COUNT(*) AS ?UserShape_c2) { ?UserShape foaf:mbox ?o . } GROUP BY ?UserShape HAVING (COUNT(*)<=1)} { SELECT ?UserShape (COUNT(*) AS ?UserShape_c3) { ?UserShape foaf:mbox ?o . FILTER (isIRI(?o)) } GROUP BY ?UserShape HAVING (COUNT(*)<=1)} FILTER (?UserShape_c2 = ?UserShape_c3)}
121122123124125126127128129130131132133134135136
.
.
.
.
Shape Expression
Shape Expressions can be converted to SPARQLBut Shape Expressions are simpler and more readable to solve this problem
Shape Expressions Language
Schema = set of Shape ExpressionsShape Expression = labeled pattern
Typical pattern = conjunction of several expressionsConjunction represented by ,
<IssueShape> { :status (:Assigned :Unassigned), :reportedBy @<UserShape>, :reportedOn xsd:date...}
<label> { ...pattern... }
Label
Conjunction
Arcs
Basic expression: an ArcArc = name definition followed by value definition
<IssueShape> { :status (:Assigned :Unassigned), :reportedBy @<UserShape>, :reportedOn xsd:date...}
:bob:isue1 :reportedBy
:status :Unassigned
:reportedOn 23-01-2013
Name defn Value defn
Value definition
Value definitions can be Value type xsd:date Matches a value of type xsd:date
Value set ( :Assigned :Unassigned )
The object is an element of the given set
Reference @<UserShape> The object has shape <UserShape>
Stem foaf:~ Starts with the IRI associated with foaf
Any - :Checked Any value except :Checked
<IssueShape> { :status (:Assigned :Unassigned), :reportedBy @<UserShape>, :reportedOn xsd:date...}
Value set
Value reference
Value type
Name definition
Name definitions can be
Name term foaf:name Matches given IRI
Name stem foaf:~ Any predicate that starts by foaf
Name any - foaf:name Any predicate except foaf:name
<IssueShape> { :status (:Assigned :Unassigned), :reportedBy @<UserShape>, :reportedOn xsd:date...}
Name terms
Alternatives
Alternatives (disjunctions) are marked by |Example 1: An agent has either foaf:name or rdfs:label
<Agent> { ( foaf:name xsd:string | rdfs:label xsd:string ) ...}
<listOfInt> { rdf:first xsd:integer , ( rdf:rest ( rdf:nil ) | rdf:rest @<listOfInt> )}
Example 2: A list of integers
Cardinalities
The same as in common regular expressions* 0 or more
+ 1 or more? 0 or 1
{m} m repetitions
{m,n} Between m and n repetitions
<IssueShape> { ... ( :reproducedBy @<UserShape>, :reproducedOn xsd:date)? , :related @<IssueShape>*}
Semantic actionsDefine actions to be executed during validation
<Issue> { ... :reportedOn xsd:date %js{ report = _.o; return true; %} , ( :reproducedBy @<UserShape> , :reproducedOn xsd:date %js{ return _.o.lex > report.lex; %} ) ?}
%lang{ ...actions... %}
Calls lang processor passing it the given actions
Example: Check that :reportedOn must be before :reproducedOn
Semantics of Shape Expressions
Operational semantics using inference rulesInspired by the semantics of RelaxNGFormalism used to define type inference systemsMatching infer shape typingsAxioms and rules of the form:
Example: matching rules ( )
More details in the paper
Graph can be decomposedin g1 and g2
Combine typingst1 and t2
Type AssignmentContext Graph
Transforming RDF using ShEx
Semantic actions can be combined with specialized languages
Possible languages: sparql, js Other examples:GenX = very simple language to generate XML
Goal: Semantic loweringMap RDF clinical records to XML
GenJ generates JSON
Example:Issue1 :status :Unassigned ; :reportedBy :Bob ; :reportedOn "2013-01-23"^^xsd:date ; :reproducedBy :Thompson.J ; :reproducedOn "2013-01-23"^^xsd:date .
:Bob foaf:name "Bob Smith" ; foaf:mbox <mail:[email protected]> .
:Thompson.J foaf:givenName "Joe", "Joseph" ; foaf:familyName "Thompson" ; foaf:mbox <mail:[email protected]> .
<issue xmlns="http://ex.example/xml" id="Issue1" status="Unassigned"> <reported date="2013-01-23"> <given-name>Bob</given-name> <family-name>Smith</family-name> <email>mail:[email protected]</email> </reported> <reproduced date="2013-01-23"> <given-name>Joe</given-name> <given-name>Joseph</given-name> <family-name>Thompson</family-name> <email>mail:[email protected]</email> </reproduced></issue>
RDF (Turtle)
XML
Shape Expressions+
GenX
GenXGenX syntax
$IRI Generates elements in that namespace
<name> Add element <name>@<name> Add attribute <name>
=<expr> XPath function applied to the value
= Don't emit the value
[-n] Place the value up n values in the hierarchy
Example transforming RDF to XML%GenX{ issue $http://ex.example/xml %}<IssueShape> { ex:status (ex:unassigned ex:assigned) %GenX{@status =substr(19)%}, ex:reportedBy @<UserShape> %GenX{ reported = %}, ex:reportedOn xsd:date %GenX{ [-1]@date %}, (ex:reproducedBy @<UserShape>, ex:reproducedOn xsd:date %GenX{ @date %} )? %GenX{ reproduced = %}, ex:related @<IssueShape>* } %GenX{ @id %}<UserShape> { (foaf:name xsd:string %GenX{ full-name %} | foaf:givenName xsd:string+ %GenX{ given-name %} , foaf:familyName xsd:string %GenX{ family-name %} ) , foaf:mbox shex:IRI ? %GenX{ email %}}
Example:Issue1 :status :Unassigned ; :reportedBy :Bob ; :reportedOn "2013-01-23"^^xsd:date ; :reproducedBy :Thompson.J ; :reproducedOn "2013-01-23"^^xsd:date .
:Bob foaf:name "Bob Smith" ; foaf:mbox <mail:[email protected]> .
:Thompson.J foaf:givenName "Joe", "Joseph" ; foaf:familyName "Thompson" ; foaf:mbox <mail:[email protected]> .
<issue xmlns="http://ex.example/xml" id="Issue1" status="Unassigned"> <reported date="2013-01-23"> <given-name>Bob</given-name> <family-name>Smith</family-name> <email>mail:[email protected]</email> </reported> <reproduced date="2013-01-23"> <given-name>Joe</given-name> <given-name>Joseph</given-name> <family-name>Thompson</family-name> <email>mail:[email protected]</email> </reproduced></issue>
RDF (Turtle)
XML
Shape Expressions+
GenX
%GenX{ issue $http://ex.example/xml %}<IssueShape> { ex:status (ex:unassigned ex:assigned) %GenX{@status =substr(19)%}, ex:reportedBy @<UserShape> %GenX{ reported = %}, ex:reportedOn xsd:date %GenX{ [-1]@date %}, (ex:reproducedBy @<UserShape>, ex:reproducedOn xsd:date %GenX{ @date %} )? %GenX{ reproduced = %}, ex:related @<IssueShape>* } %GenX{ @id %}<UserShape> { (foaf:name xsd:string %GenX{ full-name %} | foaf:givenName xsd:string+ %GenX{ given-name %} , foaf:familyName xsd:string %GenX{ family-name %} ) , foaf:mbox shex:IRI ? %GenX{ email %}}
Shape Expressions + GenX
Current ImplementationsName Main
DeveloperLanguage Features
FancyDemo Eric Prud'hommeaux
Javascript First implementationSemantic Actions - GenX, GenJConversion to SPARQLhttp://www.w3.org/2013/ShEx/
JsShExTest Jesse van Dam Javascript Supports RDF and Compact syntaxhttps://github.com/jessevdam/shextest
ShExcala Jose E. Labra Scala Several extensions: negations, reverse arcs, relations,...Efficient implementation using Derivativeshttp://labra.github.io/ShExcala/
Haws Jose E. Labra Haskell Prototype to check inference semanticshttp://labra.github.io/haws/
Applications to linked data portals2 data portals: WebIndex and LandPortal
Data portal documentationhttp://weso.github.io/wiDoc/ http://weso.github.io/landportalDoc/data<Observation> { cex:md5-checksum xsd:string , cex:computation @<Computation> , dcterms:issued xsd:integer , dcterms:publisher ( wi-org:WebFoundation ), qb:dataSet @<Dataset> , rdfs:label (@en) , sdmx-concept:obsStatus @<ObsStatus> , wi-onto:ref-area @<Area>, wi-onto:ref-indicator @<Indicator> , wi-onto:ref-year xsd:int , cex:value xsd:double, a ( qb:Observation )}
<Observation> { cex:ref-area @<Area>, cex:ref-indicator @<Indicator>, cex:ref-time @<Time>, cex:value xsd:double? , cex:computation @<Computation>, dcterms:issued xsd:dateTime, qb:dataSet @<DataSet>, qb:slice @<Slice>, rdfs:label xsd:string, lb:source @<Upload> , a ( qb:Observation )}
Same type: qb:Observation ...but different shapes More info:
Paper on Linked Data Quality Workshop
Conclusions
Shape Expressions = simple language One goal: Describe and validate RDF graphs
Semantics of Shape ExpressionsDescribed using inference rules...but Shape Expressions can be converted to SPARQL
Compatible with other Semantic technologies
Semantic actions = Extensibility mechanismCan be applied to transform RDF
Future WorkImprove implementations and language
Debugging and error messagesExpressiveness and usability of languagePerformance evaluation
Shape Expressions = role similar to Schema for XMLFuture applications:
Online validatorsInterface generatorsBinding: generate parsers/tools from shapesPerformance of RDF triplestores?
Future work at w3c
RDF Data shapes WG charteredMailing list: [email protected]
"The discussion on [email protected] is the best entertainment since years; Game of Thrones colors pale." Paul Hermans (@PaulZH)
End of presentation
Slides available at: http://www.slideshare.net/jelabra/semantics-2014