Upload
honghan2013
View
112
Download
5
Embed Size (px)
DESCRIPTION
Wu, Honghan, Boris Villazon-Terrazas, Jeff Z. Pan, and Jose Manuel Gomez-Perez. “How Redundant Is It? – An Empirical Analysis on Linked Datasets.” In ISWC COLD Workshop. 2014. http://ceur-ws.org/Vol-1264/cold2014_WuVPG.pdf
Citation preview
How redundant is it? – An empirical analysis on linked datasets
Honghan Wu1, Boris Villazon-Terrazas2, Jeff Z. Pan1 and José Manuel Gómez Pérez2
University of Aberdeen1, UK iSOCO2 , Spain
20/10/2014 1
2
Content
• What is data redundancy with linked data? • Why is it of special interest to linked data
consumption? • Linked Data redundancy categorisation • How to analysis? • Dataset selection & The Result • Conclusion
3
What is the data redundancy in LD?
• Data Redundancy – [Database systems] Same piece of data in
multiple places – [Information theory] Wasted "space" used to
transmit certain data
• (In this work)Linked Data Redundancy – Wasted “space” to represent certain meaning
(represented in certain semantics) – Duplication-free
4
Why is it of special interest to LD consumption?
• Bad Redundancy & Good Redundancy – Bad for exchange: storage, transmission – Good for inference computation
• Relevant consumption tasks – Hosting/Sharing – Query Answering (SPARQL) – Ontology Based Data Access – Reasoning
Redundancy in Linked Data
• Redundancy Categorisation for RDF Data • Redundancies caused by the “Linked” nature
6
RDF Redundancies vs. Succinct Representations
[Rule based] A. K. Joshi, P. Hitzler, and G. Dong. Logical linked data compression. In The Semantic Web: Semantics and Big Data, pages 170–184. Springer, 2013. [HDT]J. D. FernáNdez, M. A. MartíNez-Prieto, C. GutiéRrez, A. Polleres, and M. Arias. Binary rdf representation for publication and exchange (hdt). Web Semant., 19:22–41, Mar. 2013. [WaterFowl] O. Curé, G. Blin, D. Revuz, and D. C. Faye. Waterfowl: A compact, self-indexed and inference-enabled immutable rdf store. In The Semantic Web: Trends and Challenges, pages 302–316. Springer, 2014. Pan, Jeff Z., Jose Manuel Gomez-Perez, Yuan Ren, Honghan Wu, Haofen Wang and Man Zhu. “Graph Pattern based RDF Data Compression”. In Proc. of 4th Joint International Semantic Technology Conference (JIST). 2014. (To appear)
7
Semantic redundancy
Rule Representation - DL Axioms (T-Box) - Other semantics
(graph pattern substitution)
8
Syntactic Redundancy
Concise syntax - RDF abbreviation &
striping syntax - Intra-structure & Inter-
structure
9
Symbolic Redundancy
• http://xmlns.com/foaf/0.1/name – 31 bytes in ASCII
URI ID (4 bytes)
… …
http://xmlns.com/foaf/0.1/name 128
… …
Less bytes for basic data units - (Fix-length)Dictionary Based - (Variable-length) Huffman coding - Predictive encoding
10
Semantic Redundancy Caused by “Linked” Nature
• Vocabulary Linkage – Reuse of other vocabularies: more rules – Less redundancy ratio: more triples derivable – More redundancy: co-occurrence triples
removable
• Instance Linkage – sameAs linkages – Bring in new assertions (e.g., type assertions) – Bring in new axioms
How to analysis?
• Two dimension analysis • Methodology • Metrics
12
Two dimension analysis
Semantic Syntactic Symbolic
A-Box ✔ ✔
A-Box & T-Box
No Linkage ✔ - -
T-Box Reuse ✔ - -
A-Box Linkage - -
RDF Redundancy Dimension
Linked Semantic Dimension
13
Methodology: EDP Summarisation
14
Virtually Materialised A-Box: expanded EDP
A1, B1 (1) A2, B2 (1)
A-Box: A1(o1) B1(o1) A2(o2) B2(o2) R(o1, o2) T-Box: A1⊆A, A2⊆A, B1⊆B, B2⊆B
R (1:1) A, B, A, B,
Linked Dataset Analysis Results
• Dataset Selection & Summary • Analysis Results
16
Dataset Selection and Summary LOD 2011
17
A-Box Only: Semantic Redundancies
– Redundant Triples
– Semantic redundancy ratio, i.e.
– # Graph Patterns used to substitute redundant triples
18
A-Box Only: Syntactic Redundancies
– the redundant resource occurrences of inter-structural redundancies
– the syntactic redundancy ratio, i.e.
19
A-Box & T-Box: No Linkage
DBLP2013: SWRC ontology Ordnance Survey: official published OS ontology
1.7%
184%
108%
4.7%
20
A-Box & T-Box: No Linkage
First 3 datasts are reusing FOAF Ontology
– the number of directly used terms from reused T-Box
– the number of applicable axioms from (materialised) reused T-Box
26.9% 4% 45.4% 1.3%
21
Conclusion
• LOD redundancy are heterogeneous & huge
• Vocabulary linkage might lead to huge number of derivable triples
• Redundancy aware techniques are demanded
22
Redundancy-aware Consumption
• Compression: different redundancies might need different techniques
• For Data Access: (high inter-structure redundancy) skewed entity distributions over EDPs -> efficient access?
• OBDA/Reasoning: A-Box redundancy = less T-Box axioms
• Data Publisher: should be aware of the consequences of reusing
Thanks! Q & A