14
http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe Towards maintainable constraint validation and repair for taxonomies - The PoolParty approach Monika Solanki https://w3id.org/people/msolanki @nimonika University of Oxford Joint work with Christian Mader Fraunhofer IAIS, Germany

Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

Embed Size (px)

Citation preview

Page 1: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Towards maintainable constraint validationand repair for taxonomies- The PoolParty approach

Monika Solankihttps://w3id.org/people/msolanki

@nimonikaUniversity of Oxford

Joint work withChristian Mader

Fraunhofer IAIS, Germany

Page 2: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

PoolParty (SWC) Use case

PoolParty(PPT): leading commercial taxonomymanagement application, authoring tool for knowledgegraphs, provides taxonomy import functionality tointeract with third party datasetsTaxonomists using PPT integrate a variety of models,schemata, ontologies and vocabularies into theirknowledge bases.Challenge: combining varied data sources to ensure thatthese data mashups at any time conform to a set of qualityheuristics, as expected by the data processing algorithms.

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 3: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

MotivationConsuming and interlinking enterprise data and openlyavailable data within an industry setting.Ensuring that the interlinked datasets confirm to a set ofquality heuristics.Interactively detecting and repairing datasets withconstraint violations.

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 4: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Ensuring Data Consistency

Current - checks to ensure that the data persisted in the triplestore do not violate it’s data consistency are scattered in thecode and sometimes performed multiple times.

RequirementsProvide a mechanism to specify data constraints in aformal way,Identify and analyse datasets that are imported into PPTand are a source of constraint violations.

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 5: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Constraint resolutionCurrent - checks to ensure that the data persisted in the triplestore do not violate it’s data consistency are scattered in thecode and sometimes performed multiple times.

RequirementsProvide a validation mechanism to check for constraintviolation and evaluate this against the selected datasets.Combine formal data constraint definitions with reusablerepair strategies that can be easily applied by end-users ina (semi-) automatic way.

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 6: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Dataset selection

SWC-generated: Datasets for which a conversion to aPPT-compatible taxonomy has been performed by SWC(containing 10 datasets),Custom-generated: Datasets for which a conversion to aPPT-compatible taxonomy has been performed bythird-party institutions (containing 9 datasets), andWeb: Datasets that are using SKOS, but for which iscurrently unknown if they are compatible with PPT(containing 7 datasets).

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 7: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Constraint specification

ConceptTypeAssertion (cta):

SELECT DISTINCT ?resource WHERE {?resource skos:broader|skos:narrower ?otherRes.FILTER NOT EXISTS {?resource a skos:Concept}}

HierarchicalConsistency (hc):

SELECT DISTINCT ?resource WHERE {?resource a skos:ConceptFILTER NOT EXISTS {?resource (skos:broader|^skos:narrower)*/skos:

topConceptOf ?parent.?parent a skos:ConceptScheme.}}

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 8: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Validation using SHACL

HierarchicalConsistency (hc):

ppts:ConceptShapea sh:Shape;sh:scopeClass skos:Concept;sh:property [a sh:PropertyConstraint;sh:predicate skos:prefLabel;sh:minCount 1;sh:minLength 1;sh:datatype rdf:langString;sh:uniqueLang true];

sh:constraint [a sh:Constraint;a sh:OrConstraint;sh:shapes (ppts:ConceptHasBroaderShape ppts:

ConceptIsTopConceptShape)].

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 9: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Repair strategies

AddInverseStrategy

ppts:ConceptHavingBroadera sh:Shape;sh:scope [a sh:Scope;a sh:PropertyScope ;sh:predicate skos:broader];

sh:inverseProperty [a sh:InversePropertyConstraint;sh:predicate skos:narrower;sh:minCount 1;

rs:strategy [a rs:AddInverseStrategy]].

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 10: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Implementation

SHACL implementation (TopQuadrant), Sesame, SWClibraries⇒ Java applicationSKOS data model, Dataset file, Constraint specification⇒Violation reportViolation report, SKOS data model, Dataset file, Constraintspecification⇒ Triples changeset

Not yet Optimised for runtime performance

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 11: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Validation results

cta was never violated in datasets converted to PPTtaxonomies.upl is a SKOS-level constraint, better respected byvocabulary providers.Violations observed across all datasets.

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 12: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Validation performance

Omitted 10 datasets that contained ≤ 50000 triples.No correlation between the dataset size and time taken toperform the validation.Structure of the dataset makes a difference.

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 13: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Repair strategy execution performance

Repair strategy applied to a special case of the constraintbr - BidirectionalRelationsHierarical.Only considered skos:broaderThan andskos:narrowerThan. Did not consider owl:inverse.Repair scales well even with larger datasets.

[email protected], @nimonika Constraint validation and repair for taxonomies

Page 14: Towards maintainable constraint validation and repair for taxonomies: The PoolParty approach

http://aligned-project.eu COLD@ISWC, 18th October 2016, Kobe

Summary and Conclusions

Interwoven SHACL-based data consistency specificationand validation with repair strategies.Validation of datasets generated by PPT can be done withreasonable performance.Integrating repair strategies and data constraintspecification helps in building a unified, maintainablemodel.The model also plays a pivotal role in harmonizing dataand software development processes.

[email protected], @nimonika Constraint validation and repair for taxonomies