26
Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology Saemi Jang, Megawati, Jiyeon Choi, and Mun Yong Yi KIRD, KAIST NLP&DBPEDIA 2015 WORKSHOP

Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology Saemi Jang, Megawati, Jiyeon Choi, and Mun Yong Yi KIRD, KAIST NLP&DBPEDIA

Embed Size (px)

Citation preview

Semi-Automatic Quality Assess-ment of Linked Data without Re-

quiring Ontology

Saemi Jang, Megawati, Jiyeon Choi, and Mun Yong Yi

KIRD, KAIST

NLP&DBPEDIA 2015 WORKSHOP

2

Motivation

• DBpedia• extracts structured information from Wikipedia• example: Wikipedia page on Pope Saint Felix III

dbpedia:Pope_Felix_III

dbo:birthPlace

dbpedia:Rome

dbo:deathPlace

dbpedia:Odoacer

3

Motivation

• Errors in DBpedia• Incorrect data: type, datatype, value• Ambiguity: URI, property• Quality of the data has become important

rdf:type

rdf:type

dbo:Place

dbo:Person

Error

dbpedia:Pope_Felix_III

dbo:birthPlace

dbpedia:Rome

dbo:deathPlace

dbpedia:Odoacer

4

Motivation

• Data Quality Assessment• TripleCheckMate[3], LinkQA[6], WIQA[7], DaCura[8]

• Based on ontology that is built from target data (e.g. DBpedia)

• But• It is not feasible to use for data having no ontology• Ontology generation is a difficult and time consuming work• Automatic ontology generation works for English and limited domains

5

Introduction

• Goal• Quality assessment of linked data without requiring ontology

• Idea• a large portion of the data in a knowledge resource is valid data• Analyze the data patterns in resource, take the patterns appearing fre-

quently• Evaluate the quality based on the patterns

6

Overview of approach

7

Quality Assessment Criteria

• Data Quality Test Pattern (DQTP)• DQTP = tuple(V,S)• V is a set of typed pattern variables, S is a SPARQL query templet

• RDF triples (subject, predicate, object)• Domain is all possible types which can be contained by the subject• Range is all possible types that can be contained by the object• Literal values ensures a certain data type determined by the property used

8

Test Case Pattern Generation AlgorithmProperty Object Object Type

dbo:occupation

dbr:Freddie_Mercury

foaf:Person

dbr:Michael_Jackson dbo:Person

dbr:Alfred_Nobel foaf:Person

dbr:Alfred_Nobel dbo:Agent

KnowledgeResource

Check the pattern in knowledge resource

STEP 1

Compute appearance ratio of each pattern

STEP 2

Select top k pattern & Compute ratio

STEP 3

Set threshold (average of top k ratio)

STEP 4

Build test case pattern

STEP 5

Property Object Object Type

dbo:Artist dbr:Freddie_Mercury

dbo:Person

dbr:Michael_Jackson dbo:Person

dbr:Alfred_Nobel foaf:Person

dbr:Alfred_Nobel dbo:Agent

Property Object Object Type

dbo:deathPlace

dbr:London schema:Place

dbr:Chicago dbo:Place

dbr:Paris dbo:Wikidata:Q532

dbr:Seoul dbo:Place

Example: Range pattern (dbo:deathPlace)

Property Top 5 type Ratio

dbo:occupation

dbo:Place 32.8004

schema:Place 32.8004

dbo:Wikidata:Q532

32.8004

dbo:PopulatedPlace

0.2368

dbo:Settlement 0.2368

Property Top 5 type Ratio

dbo:deathPlace

dbo:Place 17.0458

schema:Place 17.0458

dbo:Wikidata:Q532

17.0458

dbo:PopulatedPlace

15.0166

dbo:Settlement 13.7303

Average of top 5 ratio= Threshold (e.g. 17%)

0

20

40

60

80

100

dataTest case pattern

Property Pattern type

dbo:deathPlace

dbo:Place, schema:Place, dbo:Wikidata:Q532

dbo:birthPlace dbo:Place, schema:Place, dbo:Wikidata:Q532

dbo:spouse dbo:Person, foaf:Person, schema:Person

9

Evaluation of approach

1) Test Case Pattern Generation• Compare the approach patterns and the benchmark patterns

– Approach generate patterns without using ontology– Benchmark generate patterns using ontology

2) Quality Assessment Accuracy• Evaluate a localized DBpedia which does not have ontology

10

Validation 1) Test Case Pattern Generation

• Ground truth• RDFUnit[4] compiled a library of data quality test case patterns for quality

assessment• Ontology of English DBpedia

• Definition of Test Case Patterns

Approach RDFUnit Definition

Domain Quality Pattern (DQP)

RDFSDOMAINThe attribution of a resource's property (with a certain value) is only valid if the resource is of a certain type.

Range Quality Pattern (RQP)

RDFSRANGE The attribution of a resource's property is only valid if the value is of a certain type

Datatype Quality Pat-tern (TQP)

RDFS-RANGED

The attribution of a resource's property is only if the literal value has a certain datatype

11

• Data

• Test Case Pattern Generation• Top 5 type average ratio is 22% for DQP, 17% for RQP• For TQP, most of the triples has a single data pattern • It generate patterns by triples in DBpedia, but RDFUnit using ontology

Validation 1) Test Case Pattern Generation

Property DQP RQP TQP

English DBpedia 2750 1368 601 739

Pattern Property Pattern type

DQPdbo:deathPlace

dbo:Agent, dbo:Person

RQP dbo:Place, dbo:PopulatedPlace, dbo:Wikidata:Q532

DBpedia 2015 ( dbo,dbp)

12

Validation 1) Test Case Pattern Generation

BA0%

20%

40%

60%

80%

100%

BA0%

20%

40%

60%

80%

100%

BA0%

20%

40%

60%

80%

100%

DQP RQP TQP

Total number of patterns with benchmark

99.2 89.4 97.8 80.2 99.0 67.7

A: Pattern generation rate

B: pattern generation accuracy of approach

Total number of generated patterns with approach

Total number of consistent patterns with approach

13

Validation 1) Test Case Pattern Generation

BA0%

20%

40%

60%

80%

100%

BA0%

20%

40%

60%

80%

100%

BA0%

20%

40%

60%

80%

100%

DQP RQP TQP

99.2 89.4 97.8 80.2 99.0 67.7

In case of TQP, the patterns have equivalent meanings with RDFUnit. But they comes from different re-sources. e.g. rdf:langString, xsd:String

14

Validation 2) Quality Assessment Accuracy

• How to validate the quality assessment accuracy?

Approach is able to handle a localized DBpedia and evaluate the quality of data

• Localized version of DBpedia in 125 languages do not have their ontologies

• Most of the label of DBpedia Ontology is composed of English label

15

Validation 2) Quality Assessment Accuracy

• Data• Localized version of DBpedia (Korean DBpedia)• 32 million triples with 18617 different properties• 1070 localized properties that are carried by more than 100 triples

• Test Case Pattern Generation• Top 5 type average ratio is 18% for DQP, 16% for RQP• For TQP, most of the triples has a single data pattern, not only datatype

but also language tag (e.g. @en)

Property DQP RQP TQP

Korean DBpedia 1070 955 317 166

Pattern Property Pattern type

DQP dbo: 죽은곳(=deathPlace)

dbo:Agent, dbo:Person

RQP dbo:Place, dbo:PopulatedPlace, dbo:Wikidata:Q532

Korean DBpedia 2015

16

Validation 2) Quality Assessment Accuracy

• Result of Data Quality Assessment• 1438 test case patterns generated by 1070 properties• 1.4 million triples tested from Korean Dbpedia

Total Domain Range Datatype

Triples TC TC Pass Error TC Pass Error TC Pass Error

1,492,331

2,452,023

1,470,389

1,075,953

394,436 613,535 176,423 437,112 368,099 309,286 58,813

Error rate 26.82% 71.24% 15.97%

17

Validation 2) Quality Assessment Accuracy

• Gold standard data– Randomly selected 1000 triples (95% confidence, 3.5% error)– 2 human evaluator (kappa 0.7207)– Annotate correct type of subject, object based on predicate

• Evaluation measure• Precision, recall, and f1-measure

• AccuracyTriples Precision Recall F1-measure

DQP 981 0.7100 0.8022 0.7533

RQP 424 0.9308 0.3438 0.5021

TQP 263 0.7395 0.8503 0.7910

18

Validation 2) Error Analysis

• Error Analysis on Korean DBpedia• The error occurrence rate of total triple is 36.31%

• The most error cases is rdf:range violation[3,4,18]

• Literal or string data, not URI• Object range validation cannot be performed[4]

Error rate (%)

Pass 63.69%

Error 36.31%

19

Validation 2) Error Analysis

• Error Analysis on Korean DBpedia• Incorrect datatype setting

e.g. the date must be set as xs:date, but it is set to xs:integer

• Incorrect object valuee.g. Object value of prop-ko: 활동기간 (=active period) is a period of time, but only the beginning point of the duration

• Property ambiguitye.g. prop-ko: 종목 (event) can have 2 totally different types on object - the name of event or the number of events

20

Limitations

• Lack of specific domain/range settinge.g.

• Quality assessment with only one triplee.g.

Property DQP

dbo:deathPlace dbo:Agent, dbo:Person

dbpedia:Michael_Jackson dbo:birthDate 1958-08-29 (xsd:date)

dbo:deathDate 1009-06-25 (xsd:date)

dbo:birthdate has to be earlier then dbo:deathDate

21

Conclusion

• Semi-automatically generates patterns from knowledge resource

• Patterns are instantiated into test cases to measure the quality of data

• more than 97% patterns are generated by approach

• This work opens a new possibility of conducting quality assessment without requiring ontology

• It can apply to any language and any domain

22

Ongoing works

• Utilizing external resources e.g. WordNet, Thesaurus

• Pattern expansion

• Create a complete validation system for determining trustworthiness

23

Questions?

25

Reference• Linked data quality assessment

[2] Quality assessment methodologies for linked open data. Zaveri, A. et al. Submitted to Semantic Web Journal (2013)

[5] Weaving the pedantic web. Hogan, A. et al. (2010)

[6] Assessing linked data mappings using network measures. Guéret et al. In The Semantic Web: Research and Applica-tions (pp. 87-102). Springer Berlin Heidelberg (2012)

[8] Improving curated web-data quality with structured harvesting and assessment. Feeney et al. International Journal on Semantic Web and Information Systems (IJSWIS), 10(2), 35-62 (2014)

[16] Swiqa-a semantic web information quality assessment framework. Fürber et al. In ECIS (Vol. 15, p. 19) (2011)

[17] Using semantic web resources for data quality management. Fürber et al. In Knowledge Engineering and Manage-ment by the Masses (pp. 211-225). Springer Berlin Heidelberg (2010)

26

Reference• Data Quality Assessment of DBpedia

[3] User-driven quality evaluation of dbpedia. Zaveri, A. et al. In Proceedings of the 9th International Conference on Se-mantic Systems (pp. 97-104). ACM (2013)

[4] Test-driven evaluation of linked data quality. Kontokostas et al. In Proceedings of the 23rd international conference on World Wide Web (pp. 747-758). ACM (2014)

[18] Crowdsourcing linked data quality assessment. Acosta et al. In The Semantic Web{ISWC 2013 (pp. 260-276). Springer Berlin Heidelberg (2013)

[19] Detecting incorrect numerical data in dbpedia. Wienand et al. In The Semantic Web: Trends and Challenges (pp. 504-518). Springer International Publishing (2014)

[20] DL-Learner: learning concepts in description logics. Lehmann, J. The Journal of Machine Learning Research, 10, 2639-2642 (2009)

• Automatic Ontology generation

[13] Automatic ontology generation using schema information. Sie et al. In Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on (pp.526-531). IEEE (2006)

[14] Text2Onto. Cimiano et al. In Natural language processing and information systems (pp. 227-238). Springer Berlin Heidelberg (2005)

[21] Automatic generation of OWL ontology from XML data source. Yahia et al. arXiv preprint arXiv:1206.0570 (2012)

[24] A robust approach to aligning heterogeneous lexical resources. Pilehvar et al. AP A 1 (2014): c2.