64
HAL Id: hal-01657144 https://hal.inria.fr/hal-01657144 Submitted on 6 Dec 2017 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Data Discovery in RDF Graphs Ioana Manolescu To cite this version: Ioana Manolescu. Data Discovery in RDF Graphs. DEXA 2017 - 28th International Conference on Database and Expert System Applications, Aug 2017, Lyon, France. pp.1-63. hal-01657144

Ioana Manolescu To cite this version - InriaIoana Manolescu To cite this version: Ioana Manolescu. Data Discovery in RDF Graphs. DEXA 2017 - 28th International Conference on Database

  • Upload
    others

  • View
    33

  • Download
    0

Embed Size (px)

Citation preview

HAL Id: hal-01657144https://hal.inria.fr/hal-01657144

Submitted on 6 Dec 2017

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Data Discovery in RDF GraphsIoana Manolescu

To cite this version:Ioana Manolescu. Data Discovery in RDF Graphs. DEXA 2017 - 28th International Conference onDatabase and Expert System Applications, Aug 2017, Lyon, France. pp.1-63. �hal-01657144�

Data Discovery in RDF Graphs

Ioana Manolescu

INRIA and Ecole Polytechnique, [email protected]

http://pages.saclay.inria.fr/Ioana.Manolescu

DEXA Conference, Aug 29, 2017

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 1 / 54

Outline

1 Background: semantic RDF graphs2 Summarizing semantic-rich RDF graphs

[CGM15a, CGM15b, CGM17a]Joint work with Sejla Cebiric (Inria) and Francois Goasdoue(U. Rennes 1 and Inria)

3 Finding insights in RDF graphs [DMS17]Joint work with Yanlei Diao and Shu Shang (EcolePolytechnique and Inria)

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 2 / 54

Part I

Background: RDF graphs

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 3 / 54

RDF

Big Data needs semantics

AI Magazine, Spring 2015

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 4 / 54

RDF

Do we really need the semantics?

Yes. All the time.

Application knowledge / constraints:

Every Senator is an ElectedO�cial which is a Person

(On Wikipedia) being BornInAPlace means being a Person

Without the semantics, we may miss query answers

Data Constraints QueryJohn is a Senator Every Senator is a Person Who is a Person?

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 5 / 54

RDF

Do we really need the semantics?

Yes. All the time.

Application knowledge / constraints:

Every Senator is an ElectedO�cial which is a Person

(On Wikipedia) being BornInAPlace means being a Person

Without the semantics, we may miss query answers

Data Constraints QueryJohn is a Senator Every Senator is a Person Who is a Person?

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 5 / 54

RDF

Do we really need the semantics?

Yes. All the time.

Application knowledge / constraints:

Every Senator is an ElectedO�cial which is a Person

(On Wikipedia) being BornInAPlace means being a Person

Semantic contraints are a compact way of encoding informa-tion

“Every ElectedO�cial is a Person” stated only once even if thou-sands of ElectedO�cials.

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 6 / 54

RDF

Semantics for Web data

Data and metadata on the Web is often structured in graphs,e.g., RDF (W3C’s Resource Description Framework)

Famous application: the Linked Open Data cloud (2017)

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 7 / 54

RDF

The Resource Description Framework (RDF)

RDF graph: set of triples

Assertion Triple Relational notation Intuition

Class s rdf:type o o(s) ”s is an o”Property s p o p(s, o) ”The p of s is o”

doi

1

Book

“El Aleph”

:b1

“J. L. Borges”

“1949”

publishedIn

hasTitle

writtenBy

hasName

rdf:typeClass

resource (URI)

blank node

“literal (string)”

property

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 8 / 54

RDF

The Resource Description Framework (RDF)

RDF graph: set of triples

Assertion Triple Relational notation Intuition

Class s rdf:type o o(s) ”s is an o”Property s p o p(s, o) ”The p of s is o”

doi

1

Book

“El Aleph”

:b1

“J. L. Borges”

“1949”

publishedIn

hasTitle

writtenBy

hasName

rdf:typeClass

resource (URI)

blank node

“literal (string)”

property

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 8 / 54

RDF

The Resource Description Framework (RDF)

Assertion Triple Relational notation Intuition

Class s rdf:type o o(s) ”s is an o”Property s p o p(s, o) ”The p of s is o”

doi

1

Book

“El Aleph”

:b1

“J. L. Borges”

“1949”

publishedIn

hasTitle

writtenBy

hasName

rdf:typeClass

resource (URI)

blank node

“literal (string)”

property

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 9 / 54

RDF

The Resource Description Framework (RDF)

Assertion Triple Relational notation Intuition

Class s rdf:type o o(s) ”s is an o”Property s p o p(s, o) ”The p of s is o”

doi

1

Book

“El Aleph”

:b1

“J. L. Borges”

“1949”

publishedIn

hasTitle

writtenBy

hasName

rdf:typeClass

resource (URI)

blank node

“literal (string)”

property

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 9 / 54

RDF RDFS

RDF Schema (RDFS)

Declare deductive constraints between classes and propertiesConstraint Triple OWA interpretation

Subclass c1 rdfs:subClassOf c2 c1 ✓ c2

Subproperty p1 rdfs:subPropertyOf p2 p1 ✓ p2

Domain typing p rdfs:domain c ⇧domain

(p) ✓ c

Range typing p rdfs:range c ⇧range

(p) ✓ c

Book

Publication

Person

writtenBy

hasAuthor

rdfs:subClassOf

rdfs:domain

rdfs:range

rdfs:subPropertyOf

“Any c1 is also a c2”

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 10 / 54

RDF RDFS

RDF Schema (RDFS)

Declare deductive constraints between classes and propertiesConstraint Triple OWA interpretation

Subclass c1 rdfs:subClassOf c2 c1 ✓ c2

Subproperty p1 rdfs:subPropertyOf p2 p1 ✓ p2

Domain typing p rdfs:domain c ⇧domain

(p) ✓ c

Range typing p rdfs:range c ⇧range

(p) ✓ c

Book

Publication

Person

writtenBy

hasAuthor

rdfs:subClassOf

rdfs:domain

rdfs:range

rdfs:subPropertyOf

“If two resources are related by p1, they are also related by p2”

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 11 / 54

RDF RDFS

RDF Schema (RDFS)

Declare deductive constraints between classes and propertiesConstraint Triple OWA interpretation

Subclass c1 rdfs:subClassOf c2 c1 ✓ c2

Subproperty p1 rdfs:subPropertyOf p2 p1 ✓ p2

Domain typing p rdfs:domain c ⇧domain

(p) ✓ c

Range typing p rdfs:range c ⇧range

(p) ✓ c

Book

Publication

Person

writtenBy

hasAuthor

rdfs:subClassOf

rdfs:domain

rdfs:range

rdfs:subPropertyOf

“Anyone having p is a c”

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 12 / 54

RDF RDFS

RDF Schema (RDFS)

Declare deductive constraints between classes and propertiesConstraint Triple OWA interpretation

Subclass c1 rdfs:subClassOf c2 c1 ✓ c2

Subproperty p1 rdfs:subPropertyOf p2 p1 ✓ p2

Domain typing p rdfs:domain c ⇧domain

(p) ✓ c

Range typing p rdfs:range c ⇧range

(p) ✓ c

Book

Publication

Person

writtenBy

hasAuthor

rdfs:subClassOf

rdfs:domain

rdfs:range

rdfs:subPropertyOf

“Anyone who is a value of p is a c”

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 13 / 54

RDF RDF entailment

Open-world assumption and RDF entailment

RDF data model based on the open-world assumption.

Deductive constraints lead to implicit triples:part of the graph even though not explicitly present

explicit triples+ ! implicit triples

entailment rules

Exhaustive application of entailment leads to saturation (closure)

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 14 / 54

RDF RDF entailment

Open-world assumption and RDF entailment

RDF data model based on the open-world assumption.

Deductive constraints lead to implicit triples:part of the graph even though not explicitly present

explicit triples+ ! implicit triples

entailment rules

Exhaustive application of entailment leads to saturation (closure)

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 14 / 54

RDF RDF entailment

Open-world assumption and RDF entailment

RDF data model based on the open-world assumption.

Deductive constraints lead to implicit triples:part of the graph even though not explicitly present

explicit triples+ ! implicit triples

entailment rules

Exhaustive application of entailment leads to saturation (closure)

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 14 / 54

RDF RDF entailment

The semantics of an RDF graph G is its saturation G1

Sample instance entailment rules from schema and instance triplesc1 rdfs:subClassOf c2 ^ s rdf:type c1 `RDF s rdf:type c2p1 rdfs:subPropertyOf p2 ^ s p1 o `RDF s p2 o

p rdfs:domain c ^ s p o `RDF s rdf:type cp rdfs:range c ^ s p o `RDF o rdf:type c

doi

1

Book

Publication

“El Aleph”

:b1

“J. L. Borges”

“1949”

Person

writtenBy

hasAuthor

publishedIn

rdfs:subClassOf

rdfs:domain

rdfs:range

rdfs:subPropertyOf

hasTitle

writtenBy

hasName

rdf:type

rdf:type

hasAuthor rdf:type

rdfs:domain

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 15 / 54

RDF Discovery

RDF graph discovery

An RDF graph can be large and complex, lack a fixed schema,include many heterogeneous values...

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 16 / 54

RDF Discovery

RDF graph discovery

An RDF graph can be large and complex, lack a fixed schema,include many heterogeneous values...

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 17 / 54

RDF Discovery

RDF graph discovery

Two approaches:

1 RDF summarization: compactly representing the explicit andimplicit structure of a graph

2 Insight discovery in RDF graphs: automatically identifyaggregation queries with interesting results

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 18 / 54

Part II

RDF summarization

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 19 / 54

Concepts

RDF summaries

Problem

RDF graph G is large, heterogeneous, partially implicit.How to compactly represent all its structure?

Existing solutions

Partial representation (frequent patterns, statistics etc.)e.g., [NM11, LYL13]Potentially not compact e.g., [GW97, CFKP15]Only for explicit data, e.g., [CDT13, ZDYZ14]

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 20 / 54

Concepts

A summary of DBLP data

150M triples

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 21 / 54

Concepts

A summary of geographic data

French territory division in regions, departments, urban areas,cities, districts etc.368K triples

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 22 / 54

Concepts

RDF summaries

We define1 RDF node equivalence relation ⌘: equivalence relation such

that class and property nodes are only equivalent tothemselves

2 RDF summary G/⌘ of an RDF graph G: the quotient of Gthrough ⌘

Recall: quotient of a directed graph G by ⌘G = (V ,E), ⌘ equivalence relation on V

G/⌘ nodes: one for ⌘ equivalence class of V

G/⌘ edges: n1⌘

a�! n2⌘ i↵ 9n1 a�! n2 2 G such that n1 represented by n1

/⌘,

n2 represented by n2/⌘

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 23 / 54

Concepts

Why do we need a special RDF equivalence?

Why not use any node equivalence? E.g., forward and backwardbisimilarity ⇠fb [HHK95]

Sample graph G and its quotient through ⇠fb

p1 p2�sp

p3 p4�sp

A

u1

u2p1

p2

u3

B

u4p1

p2

u5 u6p3

p4

�sp

p1

p3

Loss of class and (some) property names

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 24 / 54

Concepts

Why do we need a special RDF equivalence?

Why not use any node equivalence? E.g., forward and backwardbisimilarity ⇠fb [HHK95]

Sample graph G and its quotient through ⇠fb

p1 p2�sp

p3 p4�sp

A

u1

u2p1

p2

u3

B

u4p1

p2

u5 u6p3

p4

�sp

p1

p3

Loss of class and (some) property names

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 24 / 54

Concepts

Why do we need a special RDF equivalence?

Why not use any node equivalence? E.g., forward and backwardbisimilarity ⇠fb [HHK95]

Sample graph G and its quotient through ⇠fb

p1 p2�sp

p3 p4�sp

A

u1

u2p1

p2

u3

B

u4p1

p2

u5 u6p3

p4

�sp

p1

p3

Loss of class and (some) property names

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 24 / 54

Concepts

Why do we need a special RDF equivalence?

Why not use any graph node equivalence? E.g., forward andbackward bisimilarity ⇠fb

Sample graph G and its quotient through ⇠fb

p1 p2�sp

p3 p4�sp

A

u1

u2p1

p2

u3

B

u4p1

p2

u5 u6p3

p4

�sp

p1

p3

Loss of schema triples

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 25 / 54

Concepts

Why do we need a special RDF equivalence?

Why not use any graph node equivalence? E.g., forward andbackward bisimilarity ⇠fb

Sample graph G and its quotient through ⇠fb

p1 p2�sp

p3 p4�sp

A

u1

u2p1

p2

u3

B

u4p1

p2

u5 u6p3

p4

�sp

p1

p3

Loss of implicit triples

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 26 / 54

Concepts

Why do we need a special RDF equivalence?

Why not use any graph node equivalence? E.g., forward andbackward bisimilarity ⇠fb

Sample graph G and its quotient through ⇠fb

p1 p2�sp

p3 p4�sp

A

u1

u2p1

p2

u3

B

u4p1

p2

u5 u6p3

p4

�sp

p1

p3

Loss of implicit triples

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 26 / 54

Concepts

Why do we need a special RDF equivalence?

Sample graph G and its quotient through ⇠fbp1 p2�sp

p3 p4�sp

A

u1

u2p1

p2

u3

B

u4p1

p2

u5 u6p3

p4

�sp

p1

p3

Quotient of the same graph throughthe RDF node equivalence ⌘fb

p1 p2�sp

p3 p4�sp

A

B

p1p2

p3p4

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 27 / 54

Properties

Formal summary properties

For any RDF equivalence relation ⌘:

Size limit The summary is at most as large as the graph.Schema The schema of G/⌘ is the schema of G.preservationRepresentativeness Any conjunctive query q with answers on G also

has answers on its summary:q(G1) 6= ; ) q((G/⌘)

1) 6= ;This enables query pruning (for query an-swering) without saturating G

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 28 / 54

Properties

Which equivalence relations to use?

Equivalence notions previously studied

Forward / backward / forward and backward simulation

Forward / backward / forward and backward bisimulation

Adapted to semantic RDF graphs

Novel equivalence notions we introduce (see next)

Flexible similarity suited to heterogeneous graphs

Based on property cliques and possibly on RDF types

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 29 / 54

Clique-based summaries

RDF node equivalence based on property cliques

Intuition: a1, a2 are similar; r1, r2, r3, r4, r5 are similar

r1 r2 r3

a1 t1

author title

t2 e1

title editor

e2 c1

editor comment

Book

r6

Journal

r4

a2 t3

author title

r5

t4

title

editor

Spec

publishedreviewed

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 30 / 54

Clique-based summaries

RDF node equivalence based on property cliques

Output property cliques: {a, t, e, c}; {r}; {p}; ;Input property cliques: {a}; {t}; {e}; {c}; {r , p}; ;

r1 r2 r3

a1 t1

author title

t2 e1

title editor

e2 c1

editor comment

Book

r6

Journal

r4

a2 t3

author title

r5

t4

title

editor

Spec

publishedreviewed

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 31 / 54

Clique-based summaries

Weak clique-based summaries

Two nodes are weakly equivalent (⌘W

) i↵ they have the sameinput clique or the same output clique.

Weak summary G/⌘W

of the sample RDF graph G:

t

c ea

r

p

Book

Journal

Spec

⌧⌧

Property: In G/⌘W

, each data property appears exactly once )its nodes are “source of p, target of p” for each p [CGM15b].

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 32 / 54

Clique-based summaries

Weak clique-based summaries

Property: G/⌘W

nodes are “source of p, target of p” for each p.

Detecting errors in the data:: why do the birthplace anddeathplace loop?

Looking in the data, we find:hhttp://dbpedia.org/resource/Kunitomo Ikkansaii hhttp://www.w3.org/1999/02/22-rdf-syntax-ns#typeihhttp://xmlns.com/foaf/0.1/Personi .hhttp://dbpedia.org/resource/Kunitomo Ikkansaiihhttp://dbpedia.org/ontology/birthPlaceihhttp://dbpedia.org/resource/Kunitomo Ikkansaii .

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 33 / 54

Clique-based summaries

Strong clique-based summaries

Two nodes are strongly equivalent (⌘S

) i↵ they have the sameinput clique and the same output clique.

Strong summary G/⌘ S

of the sample RDF graph G:

a

t

c e

Book

Journal

Spec

t

a e

r

p

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 34 / 54

Clique-based summaries

Using types for summarization

Group nodes first by their types; then group untyped nodes bytheir property cliques.Typed weak summary G/⌘TW

of the sample RDF graph G:

author title title editor editor comment

Book

Journal

author

title title editor

Spec⌧

published

reviewed

On this example, this is also the typed strong summary G/⌘TS

.Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 35 / 54

Clique-based summaries

RDF summaries outline

Summary Weak? Strong? Types first?

G/⌘W

XG/⌘ S

XG/⌘TW

X XG/⌘TS

X X

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 36 / 54

Clique-based summaries

RDF summaries outline

Summary Weak? Strong? FW bisim? BW bisim? Types first?

G/⌘W

XG/⌘ S

XG/⌘TW

X XG/⌘TS

X XG/⌘ fw

XG/⌘ bw

XG/⌘ fb

X XG/⌘ fw,T X XG/⌘ bw,T X XG/⌘ fb,T X X X

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 37 / 54

Clique-based summaries

Relations between RDF summaries [CGM17b]

G/fb G/S G/W/W/S

G

/fb /S /W

G/TW

/TW

G/TS

/TS

/TW

/TS

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 38 / 54

Summary sizes

Summary size comparison (more in [CGM17b])

Graph G |G| Summary G/⌘ |G/⌘| cf⌘

DBLP 150,787,464 G/W 71 2,123,767DBLP 150,787,464 G/S 206 731,978DBLP 150,787,464 G/fw 262,695 574

LUBM1M 1,227,868 G/W 161 7,579LUBM1M 1,227,868 G/S 207 5,903LUBM1M 1,227,868 G/fw 1982 617

LUBM10M 11,990,183 G/W 162 74,013LUBM10M 11,990,183 G/S 206 58,204LUBM10M 11,990,183 G/fw 24,958 480LUBM10M 11,990,183 G/bw 6,162 1,944LUBM10M 11,990,183 G/fb 11,990,076 1

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 39 / 54

Summary sizes

Summarizing G1

Recall: With an RDF Schema, the semantics of G is G1 )We really need (G1)/⌘!

1 Saturate G, then summarize

2 Can we avoid saturating G?...

Shortcut theorem [CGM17a]

For the summaries G/W, G/S, G/fw, G/bw, G/fb:

(G1)/⌘ is the same as ((G/⌘)1)/⌘

Also: su�cient condition for any ⌘ to admit the shortcut.

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 40 / 54

Summary sizes

Summarizing G1

Recall: With an RDF Schema, the semantics of G is G1 )We really need (G1)/⌘!

1 Saturate G, then summarize

2 Can we avoid saturating G?...

Shortcut theorem [CGM17a]

For the summaries G/W, G/S, G/fw, G/bw, G/fb:

(G1)/⌘ is the same as ((G/⌘)1)/⌘

Also: su�cient condition for any ⌘ to admit the shortcut.

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 40 / 54

Summary sizes

Shortcut toward the summary of G1

Direct G ! sat. ! G

1 ! summ. ! (G1)⌘Shortcut G ! summ. ! G⌘ ! sat. ! (G⌘)1 ! summ. ! ((G⌘)1)⌘

If G⌘ is much smaller than G, the shortcut may be faster!Up to 20 times in our experiments [CGM17b]

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 41 / 54

Summary sizes

Shortcut example: GW

y2

y1

z

x

r1a

b1

r2b2

c

b1 �sp b, b2 �sp b

G

a

b1

G/W

b2c

b1 �sp b

b2 �sp b

a

b1b

(G/W)1

b2c

b1 �sp b

b2 �sp b

b

r1

x

y1

a

b1b

G

1

r2

y2

z

b2c

b1 �sp b

b2 �sp b

b

b

(G1)/W = ((G/W)1)/W

b1

a

b2

c

b1 �sp b, b2 �sp b

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 42 / 54

Summary sizes

Shortcut counter-example: GTW

y2

y1

x

r1a

b

r2b

a -d c

G

a

b

G/TW

a -d c

a

b

c

(G/TW)1 = ((G/TW)

1)/TW

a -d c

r1

x

y1

a

b

c

G

1

r2

y2b

a -d c

a

b

c

(G1)/TW

b

a -d c

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 43 / 54

Summary sizes

Summary-enabled LOD cloud exploration

ILDA Inria team (E. Pietriga, H. Ozaygen)Use summary to derive visualisation instead of the original graph(smaller, faster)

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 44 / 54

Part III

Finding insights in RDF graphs

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 45 / 54

Insight in an RDF graph

We consider an insight to be the result of an aggregation queryover the RDF graphWe focus one-dimensional aggregates ) 2D layout

An insight is interesting if a certain measure (e.g., variance) on itsset of aggregation values is high

Problem

Problem: given a graph G, find the top-k insights

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 46 / 54

Dagger approach

Dagger: Digging for Interesting Aggregates in RDFGraphs [DMS17] (ongoing)1. Candidate facts Resources from G: of a certain type, or havingcertain property sets2. Candidate dimension Properties of the candidate facts, withstrong support and relatively few distinct values.Also: derived properties, e.g., authors count;3. Candidate measure Another property of the candidate factsAlso: automatic value typing4. Candidate aggregation function Chosen depending on themeasure type

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 47 / 54

Dagger-selected aggregate in DBLP data

Average number of authors of journal articles, per publication year

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 48 / 54

Dagger-selected aggregate in DBLP data

Number of book authors, per book publication year

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 49 / 54

Dagger-selected aggregate in DBLP data

The number of books by each publisher (highest: Springer)

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 50 / 54

Part IV

Conclusion

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 51 / 54

The need for RDF graph discovery tools

RDF graphs can be large and complex, they lack aprescriptive schema

Semantic rules lead to implicit data

Toward helping users to discover RDF graphs:

1 Structural quotient summaries representing the completegraph structure; compact clique-based summaries; available at:

https://team.inria.fr/cedar/projects/rdfsummary/

2 Insight discovery: interesting aggregate queries; project Webpage:

https://team.inria.fr/cedar/projects/dagger/

Many follow-up directions: parallelization, moreinterestingness measures, extensions to ML.

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 52 / 54

References

[CDT13] Stephane Campinas, Renaud Delbru, and Giovanni Tummarello. E�ciencyand precision trade-o↵s in graph summary algorithms. In IDEAS, 2013.

[CFKP15] Mariano P. Consens, Valeria Fionda, Shahan Khatchadourian, andGiuseppe Pirro. S+EPPs: Construct and explore bisimulation summaries+ optimize navigational queries (demo). PVLDB, 8(12), 2015.

[CGM15a] Sejla Cebiric, Francois Goasdoue, and Ioana Manolescu. Query-orientedsummarization of RDF graphs. In BICOD, 2015.

[CGM15b] Sejla Cebiric, Francois Goasdoue, and Ioana Manolescu. Query-orientedsummarization of RDF graphs (demonstration). PVLDB, 8(12), 2015.

[CGM17a] Sejla Cebiric, Francois Goasdoue, and Ioana Manolescu. A framework fore�cient representative summarization of RDF graphs. In InternationalSemantic Web Conference (ISWC), 2017.

[CGM17b] Sejla Cebiric, Francois Goasdoue, and Ioana Manolescu. Query-OrientedSummarization of RDF Graphs. Research Report RR-8920, INRIA, 2017.

[DMS17] Yanlei Diao, Ioana Manolescu, and Shu Shang. Dagger: Digging forinteresting aggregates in RDF graphs. In International Semantic WebConference (ISWC), 2017.

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 53 / 54

References (cont.)

[GW97] Roy Goldman and Jennifer Widom. Dataguides: Enabling queryformulation and optimization in semistructured databases. In VLDB, 1997.

[HHK95] Monika Rauch Henzinger, Thomas A. Henzinger, and Peter W. Kopke.Computing simulations on finite and infinite graphs. In FOCS, 1995.

[LYL13] Shou-De Lin, Mi-Yen Yeh, and Cheng-Te Li. Sampling and summarizationfor social networks (tutorial), 2013.

[NM11] Thomas Neumann and Guido Moerkotte. Characteristic sets: Accuratecardinality estimation for RDF queries with multiple joins. In ICDE, 2011.

[ZDYZ14] Haiwei Zhang, Yuanyuan Duan, Xiaojie Yuan, and Ying Zhang. ASSG:adaptive structural summary for RDF graph data. In ISWC (Posters andDemonstrations), 2014.

Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 54 / 54