23
How redundant is it? – An empirical analysis on linked datasets Honghan Wu 1 , Boris Villazon-Terrazas 2 , Jeff Z. Pan 1 and José Manuel Gómez Pérez 2 University of Aberdeen 1 , UK iSOCO 2 , Spain 20/10/2014 1

Redundancy analysis on linked data #cold2014 #ISWC2014

Embed Size (px)

DESCRIPTION

Wu, Honghan, Boris Villazon-Terrazas, Jeff Z. Pan, and Jose Manuel Gomez-Perez. “How Redundant Is It? – An Empirical Analysis on Linked Datasets.” In ISWC COLD Workshop. 2014. http://ceur-ws.org/Vol-1264/cold2014_WuVPG.pdf

Citation preview

Page 1: Redundancy analysis on linked data #cold2014 #ISWC2014

How redundant is it? – An empirical analysis on linked datasets

Honghan Wu1, Boris Villazon-Terrazas2, Jeff Z. Pan1 and José Manuel Gómez Pérez2

University of Aberdeen1, UK iSOCO2 , Spain

20/10/2014 1

Page 2: Redundancy analysis on linked data #cold2014 #ISWC2014

2

Content

• What is data redundancy with linked data? • Why is it of special interest to linked data

consumption? • Linked Data redundancy categorisation • How to analysis? • Dataset selection & The Result • Conclusion

Page 3: Redundancy analysis on linked data #cold2014 #ISWC2014

3

What is the data redundancy in LD?

• Data Redundancy – [Database systems] Same piece of data in

multiple places – [Information theory] Wasted "space" used to

transmit certain data

• (In this work)Linked Data Redundancy – Wasted “space” to represent certain meaning

(represented in certain semantics) – Duplication-free

Page 4: Redundancy analysis on linked data #cold2014 #ISWC2014

4

Why is it of special interest to LD consumption?

• Bad Redundancy & Good Redundancy – Bad for exchange: storage, transmission – Good for inference computation

• Relevant consumption tasks – Hosting/Sharing – Query Answering (SPARQL) – Ontology Based Data Access – Reasoning

Page 5: Redundancy analysis on linked data #cold2014 #ISWC2014

Redundancy in Linked Data

• Redundancy Categorisation for RDF Data • Redundancies caused by the “Linked” nature

Page 6: Redundancy analysis on linked data #cold2014 #ISWC2014

6

RDF Redundancies vs. Succinct Representations

[Rule based] A. K. Joshi, P. Hitzler, and G. Dong. Logical linked data compression. In The Semantic Web: Semantics and Big Data, pages 170–184. Springer, 2013. [HDT]J. D. FernáNdez, M. A. MartíNez-Prieto, C. GutiéRrez, A. Polleres, and M. Arias. Binary rdf representation for publication and exchange (hdt). Web Semant., 19:22–41, Mar. 2013. [WaterFowl] O. Curé, G. Blin, D. Revuz, and D. C. Faye. Waterfowl: A compact, self-indexed and inference-enabled immutable rdf store. In The Semantic Web: Trends and Challenges, pages 302–316. Springer, 2014. Pan, Jeff Z., Jose Manuel Gomez-Perez, Yuan Ren, Honghan Wu, Haofen Wang and Man Zhu. “Graph Pattern based RDF Data Compression”. In Proc. of 4th Joint International Semantic Technology Conference (JIST). 2014. (To appear)

Page 7: Redundancy analysis on linked data #cold2014 #ISWC2014

7

Semantic redundancy

Rule Representation - DL Axioms (T-Box) - Other semantics

(graph pattern substitution)

Page 8: Redundancy analysis on linked data #cold2014 #ISWC2014

8

Syntactic Redundancy

Concise syntax - RDF abbreviation &

striping syntax - Intra-structure & Inter-

structure

Page 9: Redundancy analysis on linked data #cold2014 #ISWC2014

9

Symbolic Redundancy

• http://xmlns.com/foaf/0.1/name – 31 bytes in ASCII

URI ID (4 bytes)

… …

http://xmlns.com/foaf/0.1/name 128

… …

Less bytes for basic data units - (Fix-length)Dictionary Based - (Variable-length) Huffman coding - Predictive encoding

Page 10: Redundancy analysis on linked data #cold2014 #ISWC2014

10

Semantic Redundancy Caused by “Linked” Nature

• Vocabulary Linkage – Reuse of other vocabularies: more rules – Less redundancy ratio: more triples derivable – More redundancy: co-occurrence triples

removable

• Instance Linkage – sameAs linkages – Bring in new assertions (e.g., type assertions) – Bring in new axioms

Page 11: Redundancy analysis on linked data #cold2014 #ISWC2014

How to analysis?

• Two dimension analysis • Methodology • Metrics

Page 12: Redundancy analysis on linked data #cold2014 #ISWC2014

12

Two dimension analysis

Semantic Syntactic Symbolic

A-Box ✔ ✔

A-Box & T-Box

No Linkage ✔ - -

T-Box Reuse ✔ - -

A-Box Linkage - -

RDF Redundancy Dimension

Linked Semantic Dimension

Page 13: Redundancy analysis on linked data #cold2014 #ISWC2014

13

Methodology: EDP Summarisation

Page 14: Redundancy analysis on linked data #cold2014 #ISWC2014

14

Virtually Materialised A-Box: expanded EDP

A1, B1 (1) A2, B2 (1)

A-Box: A1(o1) B1(o1) A2(o2) B2(o2) R(o1, o2) T-Box: A1⊆A, A2⊆A, B1⊆B, B2⊆B

R (1:1) A, B, A, B,

Page 15: Redundancy analysis on linked data #cold2014 #ISWC2014

Linked Dataset Analysis Results

• Dataset Selection & Summary • Analysis Results

Page 16: Redundancy analysis on linked data #cold2014 #ISWC2014

16

Dataset Selection and Summary LOD 2011

Page 17: Redundancy analysis on linked data #cold2014 #ISWC2014

17

A-Box Only: Semantic Redundancies

– Redundant Triples

– Semantic redundancy ratio, i.e.

– # Graph Patterns used to substitute redundant triples

Page 18: Redundancy analysis on linked data #cold2014 #ISWC2014

18

A-Box Only: Syntactic Redundancies

– the redundant resource occurrences of inter-structural redundancies

– the syntactic redundancy ratio, i.e.

Page 19: Redundancy analysis on linked data #cold2014 #ISWC2014

19

A-Box & T-Box: No Linkage

DBLP2013: SWRC ontology Ordnance Survey: official published OS ontology

1.7%

184%

108%

4.7%

Page 20: Redundancy analysis on linked data #cold2014 #ISWC2014

20

A-Box & T-Box: No Linkage

First 3 datasts are reusing FOAF Ontology

– the number of directly used terms from reused T-Box

– the number of applicable axioms from (materialised) reused T-Box

26.9% 4% 45.4% 1.3%

Page 21: Redundancy analysis on linked data #cold2014 #ISWC2014

21

Conclusion

• LOD redundancy are heterogeneous & huge

• Vocabulary linkage might lead to huge number of derivable triples

• Redundancy aware techniques are demanded

Page 22: Redundancy analysis on linked data #cold2014 #ISWC2014

22

Redundancy-aware Consumption

• Compression: different redundancies might need different techniques

• For Data Access: (high inter-structure redundancy) skewed entity distributions over EDPs -> efficient access?

• OBDA/Reasoning: A-Box redundancy = less T-Box axioms

• Data Publisher: should be aware of the consequences of reusing

Page 23: Redundancy analysis on linked data #cold2014 #ISWC2014

Thanks! Q & A