Upload
dr-ing-thomas-hartmann
View
507
Download
0
Embed Size (px)
Citation preview
Validating RDF Data Quality using Constraints
to Direct the Development of Constraint Languages
Thomas Hartmann
Benjamin Zapilko, Joachim Wackerow, Kai Eckert
International Conference on Semantic Systems (ICSC 2016)
XML Validation
<!ELEMENT library (book+, author*)>
<!ELEMENT book (isbn, title, author-ref+)>
<!ATTLIST book
id ID #REQUIRED
>
<!ELEMENT author-ref EMPTY>
<!ATTLIST author-ref
id IDREF #REQUIRED
>
<!ELEMENT author (name)>
<!ATTLIST author
id ID #REQUIRED
>
<!ELEMENT isbn (#PCDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT name (#PCDATA)>
RDF Validation Workshop
Working Groups on RDF Validation
W3C Data Shapes Working Group
DCMI RDF Application Profiles Task Group
http://purl.org/net/rdf-validation
81 Types of Constraints on RDF Data
Constraint Languages
SPARQL Query Language for RDF
SELECT ?concept
WHERE {
?concept a [ rdfs:subClassOf* skos:Concept ] .
FILTER NOT EXISTS {
?concept ?p ?o .
FILTER ( ?p IN (
skos:related,
skos:relatedMatch,
skos:broader, ... ) ) . } }
SPARQL Inferencing Notation (SPIN)
# FILTER NOT EXISTS { ?book author ?person }
[ a sp:Filter ;
sp:expression [
a sp:notExists ;
sp:elements (
[ sp:subject [ sp:varName "book" ] ;
sp:predicate author ;
sp:object [ sp:varName "person" ]])]])
Web Ontology Language (OWL)
:Publication rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty :author ;
owl:allValuesFrom :Person ] .
Shape Expressions (ShEx)
:Publication {
( :isbn xsd:string, :title xsd:string )
|
( :issn xsd:string, :title xsd:string )}
Resource Shapes (ReSh)
:Computer-Science-Book
a oslc:ResourceShape ;
oslc:property [
oslc:propertyDefinition :subject ;
oslc:allowedValues [
oslc:allowedValue
"Computer Science" ,
"Informatics" ,
"Information Technology" ] ] .
[ a dsp:DescriptionTemplate ;
dsp:resourceClass :Science-Fiction-Book ;
dsp:statementTemplate [
dsp:property :subject ;
dsp:nonLiteralConstraint [
dsp:valueClass skos:Concept ;
dsp:valueURI
:Science-Fiction, :Sci-Fi, :SF ;
dsp:vocabularyEncodingScheme
:Science-Fiction-Book-Subjects ; ] ] .
Description Set Profiles (DSP)
Shapes Constraint Language (SHACL)
:BookShape
a sh:Shape ;
sh:scopeClass :Book ;
sh:property [
sh:predicate :author ;
sh:valueShape :PersonShape ;
sh:minCount 1 ; ] .
Constraint Types Classification
1. RDFS/OWL Based
2. Constraint Language Based
3. SPARQL Based
RDFS/OWL Based
:Publication rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty :author ;
owl:allValuesFrom :Person ] .
Constraint Language Based
:Publication {
( :isbn xsd:string, :title xsd:string )
|
( :issn xsd:string, :title xsd:string )}
SPARQL Based
SELECT ?concept
WHERE {
?concept a [ rdfs:subClassOf* skos:Concept ] .
FILTER NOT EXISTS {
?concept ?p ?o .
FILTER ( ?p IN (
skos:related,
skos:relatedMatch,
skos:broader, ... ) ) . } }
Constraints Classification
1. Informational
2. Warning
3. Error
Evaluation Setup
• 115 constraints from vocabularies and experts
• constraints classified and implemented
• on 3 vocabularies in the SBE sciences– well-established vocabularies (QB, SKOS)
– vocabulary under development (DDI-RDF)
Validated Data Sets
Vocabulary Data Sets Triples
QB 9,990 3,775,983,610
SKOS 4,178 477,737,281
DDI-RDF 1,526 9,673,055
Total 15,694 4.26 billion
33 SPARQL Endpoints
Finding 1
C [%] CV [%]
SPARQL 63.2 78.2
CL 34.7 21.8
RDFS/OWL 35.6 21.8
C (constraints), CV (constraint violations)
Finding 2
C [%] CV [%]
SPARQL 63.2 78.2
CL 34.7 21.8
RDFS/OWL 35.6 21.8
C (constraints), CV (constraint violations)
Finding 3
C [%] CV [%]
Info 42.3 31.3
Warning 18.7 62.7
Error 39.0 6.1
C (constraints), CV (constraint violations)
Limitations
> 3 Vocabularies
> 1 Domain