Upload
tommaso-soru
View
324
Download
2
Tags:
Embed Size (px)
Citation preview
Tommaso Soru, Edgard Marx, Axel-Cyrille Ngonga Ngomo AKSW, Department of Computer Science University of Leipzig, Germany !!!!!
May 22, 2015 WWW 2015 — Florence, Italy
ROCKER A Refinement Operator for Key Discovery
1
Text
State of the Linked Open Data cloud.353 accessible RDF datasets; ~74 billion triples.
Sources: State of the LOD cloud, LODStats, 2015.
2
Decentral data publication.Real-world entity “Florence, Italy” is described in:
3
DBpedia Linked GeoData
Geo Names
Unique descriptions of resources.
Entity search.
Data integration.
Linked data compression.
Link discovery.
Question answering.
Data quality.
4
Unique descriptions of resources.
Entity search.
Data integration.
Linked data compression.
Link discovery.
Question answering.
Data quality.
4
Keys.
Background.
5
A key is a set of properties which can distinguish all instances of a class in a knowledge base.
Background.
5
A key is a set of properties which can distinguish all instances of a class in a knowledge base.
:Brad_Pitt
:Julia_Roberts
:Oceans_Eleven:The_Mexican
:hasActor
:hasActor:hasActor
:hasActor
“Ocean’s Eleven”
“Julia Roberts”
“The Mexican”
“Brad Pitt”
rdfs:label
rdfs:label rdfs:label
rdfs:label
6
A key is a minimal key if none of its subsets is also a key.
Background.
candidate key distinguishable resources key? min-key?
{rdfs:label} 2 / 2 yes yes
{:hasActor} 1 / 2 no no
{rdfs:label, :hasActor} 2 / 2 yes no
dbpedia-owl:Film
7
A set of properties is called an n-almost-key for a class if it can distinguish all except n instances of that class.
Background.
:Canada
:Iceland
:United_States
:filmedIn:Interstellar
:United_States
:United_Kingdom
:filmedIn:Blade_Runner
:United_States
:United_Kingdom
:filmedIn:2001_A_Space_Odyssey
:WALL-E
7
A set of properties is called an n-almost-key for a class if it can distinguish all except n instances of that class.
Background.
:Canada
:Iceland
:United_States
:filmedIn:Interstellar
:United_States
:United_Kingdom
:filmedIn:Blade_Runner
:United_States
:United_Kingdom
:filmedIn:2001_A_Space_Odyssey
:WALL-E
✗
8
ROCKER’s score function.The score function expresses
the rate of distinguishable instances in a class, given a set of properties (i.e., a candidate key).
:Interstellar
:Blade_Runner
:2001_A_Space_Odyssey
:WALL-E
✗ } score({: filmedIn})
=s∈S :∀ ′s ∈S s ≠ ′s ⇒ discr(s, ′s ,{: filmedIn}){ }
S= .75
8
ROCKER’s score function.The score function expresses
the rate of distinguishable instances in a class, given a set of properties (i.e., a candidate key).
:Interstellar
:Blade_Runner
:2001_A_Space_Odyssey
:WALL-E
✗ }An n-almost-key has a score of at least .α =
S − nS
score({: filmedIn})
=s∈S :∀ ′s ∈S s ≠ ′s ⇒ discr(s, ′s ,{: filmedIn}){ }
S= .75
Contribution #1
A more complete definition of key.All object values are considered (e.g., ).
Null values are accepted (e.g., ).
9
:United_States
:WALL-E
:Canada
:Iceland
:United_States
:filmedIn:Interstellar
:United_States
:United_Kingdom
:filmedIn:Blade_Runner
:WALL-E
10
Properties of a key.Key monotonicity.
Adding a property to a key yields another key.
{:p1, :p2, :p3}{:p1, :p2}⋃ {:p3}
10
Properties of a key.Key monotonicity.
Adding a property to a key yields another key.
Non-key monotonicity. Removing a property from a non-key yields another non-key.
{:p1, :p2, :p3}{:p1, :p2}⋃ {:p3}
{:p1, :p4}{:p1, :p2, :p4}∖ {:p2}
11
Proposed approach.We adopt a refinement operator to refine candidates.
{:p1, :p2, :p3}
∅
{:p1, :p3}
{:p1} {:p3}
{:p1, :p2} {:p2, :p3}
{:p2}
12
Proposed approach.Pro. The score function induces a quasi-ordering ‘≼’ over the set of all candidates.
P≼Q means score(p) ≤ score(q)
12
Proposed approach.Pro. The score function induces a quasi-ordering ‘≼’ over the set of all candidates.
P≼Q means score(p) ≤ score(q)
Contra. Visiting the refinement tree is an intractable problem!
n properties 2ⁿ–1 nodes
Solutions to intractability.Prune branches using key monotonicity:
for all descendants of a key;
for all ancestors of a non-key.
Consider only a subset of popular properties.
Provide a “fast search” option which selects one of the multiple discovery strategies.
13
Algorithm.
14
Frontier := {∅}
Algorithm.
14
Frontier := {∅}
Top el. score?
Algorithm.
14
Frontier := {∅}
Top el. score?< α
≥ α
Halt
Sort by score
Algorithm.
14
Frontier := {∅}
Top el. score?< α
≥ α
Halt
Sort by score
Refine pivot,remove pivot & add children to frontier
Algorithm.
14
Frontier := {∅}
Top el. score?< α
≥ α
Halt
Sort by score
Refine pivot,remove pivot & add children to frontier
Has children?
Algorithm.
14
Frontier := {∅}
Top el. score?< α
≥ α
Halt
Sort by score
Refine pivot,remove pivot & add children to frontier
Has children?
Next child
yes
no
Algorithm.
14
Frontier := {∅}
Top el. score?< α
≥ α
Halt
Sort by score
Refine pivot,remove pivot & add children to frontier
Has children?
Next child
Ancestor of !key?
yes
no
false true
Algorithm.
14
Frontier := {∅}
Top el. score?< α
≥ α
Halt
Sort by score
Refine pivot,remove pivot & add children to frontier
Has children?
Next child
Add to !keys
Ancestor of !key?
yes
no
false true
yes
Algorithm.
14
Frontier := {∅}
Top el. score?< α
≥ α
Halt
Sort by score
Refine pivot,remove pivot & add children to frontier
Has children?
Next child
Add to !keys
Ancestor of !key?
Descendant of key?
yes
no
false true
noyes
Algorithm.
14
Frontier := {∅}
Top el. score?< α
≥ α
Halt
Sort by score
Refine pivot,remove pivot & add children to frontier
Has children?
Next child
Add to keys
Add to !keys
Ancestor of !key?
Descendant of key?
yes
no
false true
no
yes
yes
Algorithm.
14
Frontier := {∅}
Top el. score?< α
≥ α
Halt
Sort by score
Refine pivot,remove pivot & add children to frontier
Has children?
Next child
Add to keys
Add to !keys
Ancestor of !key?
Descendant of key?
Score?
yes
no
false true
no
no
yes
yes
Algorithm.
14
Frontier := {∅}
Top el. score?< α
≥ α
Halt
< α
≥ α
Sort by score
Refine pivot,remove pivot & add children to frontier
Has children?
Next child
Add to keys
Add to !keys
Ancestor of !key?
Descendant of key?
Score?
yes
no
false true
no
no
yes
yes
15
{:p1, :p2, :p3}
∅
{:p1, :p3}
{:p1} {:p3}
{:p1, :p2} {:p2, :p3}
{:p2}
Refinement operator.frontier
min-keys
max-non-keys
∅
unvisited nodes
visited nodes
15
{:p1, :p2, :p3}
∅
{:p1, :p3}
{:p1} {:p3}
{:p1, :p2} {:p2, :p3}
{:p2}
Refinement operator.frontier
min-keys
max-non-keys∅
unvisited nodes
visited nodes
{:p1, :p2, :p3}
15
{:p1, :p2, :p3}
∅
{:p1, :p3}
{:p1} {:p3}
{:p1, :p2} {:p2, :p3}
{:p2}
Refinement operator.frontier
min-keys
max-non-keys∅
unvisited nodes
visited nodes
{:p1, :p2, :p3}
{:p1, :p2, :p3}
15
{:p1, :p2, :p3}
∅
{:p1, :p3}
{:p1} {:p3}
{:p1, :p2} {:p2, :p3}
{:p2}
Refinement operator.frontier
min-keys
max-non-keys∅
unvisited nodes
visited nodes
{:p1} {:p2} {:p3}
{:p1, :p2, :p3}
{:p1}
{:p2}
15
{:p1, :p2, :p3}
∅
{:p1, :p3}
{:p1} {:p3}
{:p1, :p2} {:p2, :p3}
{:p2}
Refinement operator.frontier
min-keys
max-non-keys
unvisited nodes
visited nodes
{:p1} {:p2} {:p3}
{:p2} {:p3}
{:p3}
{:p1, :p2, :p3}
{:p1}
{:p2, :p3}
15
{:p1, :p2, :p3}
∅
{:p1, :p3}
{:p1} {:p3}
{:p1, :p2} {:p2, :p3}
{:p2}
Refinement operator.frontier
min-keys
max-non-keys
unvisited nodes
visited nodes
{:p1} {:p2} {:p3}
{:p3}{:p2, :p3}
{:p1, :p2, :p3}
{:p2, :p3}
{:p1}
15
{:p1, :p2, :p3}
∅
{:p1, :p3}
{:p1} {:p3}
{:p1, :p2} {:p2, :p3}
{:p2}
Refinement operator.frontier
min-keys
max-non-keys
unvisited nodes
visited nodes
{:p1} {:p2} {:p3}
{:p3}{:p2, :p3}
{:p1, :p2, :p3}
{:p2, :p3}
{:p1}
15
{:p1, :p2, :p3}
∅
{:p1, :p3}
{:p1} {:p3}
{:p1, :p2} {:p2, :p3}
{:p2}
Refinement operator.frontier
min-keys
max-non-keys
unvisited nodes
visited nodes
{:p1} {:p2} {:p3}
{:p2, :p3}
{:p1, :p2, :p3}
{:p2, :p3}
16
Related work on key discovery.Linkkey (Atencia et al., 2014)
• Tool able to retrieve keys. • Relies on an incomplete definition of key. • State of the Art for small datasets.
SAKey (Symeonidou et al., 2014) • Tool able to retrieve keys and n-almost keys. • Relies on an incomplete definition of key. • State of the Art on bigger datasets.
KD2R (Symeonidou et al., 2011) • Tool able to retrieve keys. • Relies on an incomplete definition of key.
17
Evaluation.
Runtime.
Memory consumption.
Quality of the keys found.
18
Results – Runtime.ROCKER Linkkey SAKey
OAEI Restaurant1 (10 1,880 1,698 1,028
DBpedia Person Function (10 14,565 OutOfMem 6,221
DBpedia Career Station (10 79,964 OutOfMem 2,199,854
DBPedia Organisation Member (10 1,075,679 227,336 OutOfMem
DBpedia Village (10 4,224,338 OutOfMem OutOfMem
DBpedia Musical Work (10 2,524,120 OutOfMem OutOfMem
Dataset sizes in triples. Results in milliseconds.
19
Results – RAM consumption.ROCKER Linkkey SAKey
OAEI Restaurant1 (10 ~5 MB ~2 MB ~2 MB
DBpedia Person Function (10 2.5 GB > 16 GB 1.8 GB
DBpedia Career Station (10 3.5 GB > 16 GB 14.0 GB
DBPedia Organisation Member (10 3.8 GB 14.5 GB > 16 GB
DBpedia Village (10 4.1 GB > 16 GB > 16 GB
DBpedia Musical Work (10 5.0 GB > 16 GB > 16 GB
Dataset sizes in triples. Experiments were run on a 16 GB Ubuntu Linux machine.
Runtime by threshold.
20
Retrieve all candidates whose score is above a threshold α.
Results in milliseconds.
Runtime by threshold.
20
Retrieve all candidates whose score is above a threshold α.
α = 1 α = .999
Results in milliseconds.
21
Retrieve all candidates whose score is above a threshold α.
Results for dataset dbpedia:Monument.
Runtime by threshold.
21
Retrieve all candidates whose score is above a threshold α.
Results for dataset dbpedia:Monument.
Runtime by threshold.
runtime (ms)
22
Contributions.Complete definition of keys by considering multi-object properties and null values.
More scalability in terms of:
Faster execution on larger datasets.
Less memory consumption.
Running ROCKER without restrictions is guaranteed to return minimal keys.
23
Info and future work.ROCKER is part of LIMES – link discovery framework. Its source code is online at http://github.com/AKSW/rocker.
23
Info and future work.ROCKER is part of LIMES – link discovery framework. Its source code is online at http://github.com/AKSW/rocker.
A demo is currently under development, to show how ROCKER can improve data quality by searching for n-almost-keys.
23
Info and future work.ROCKER is part of LIMES – link discovery framework. Its source code is online at http://github.com/AKSW/rocker.
A demo is currently under development, to show how ROCKER can improve data quality by searching for n-almost-keys.
We will evaluate ROCKER inside of the link discovery workflow, i.e.: How can keys help find good link specifications?
Tommaso Soru PhD student at University of Leipzig
Room P905, Fakultät für Mathematik und Informatik Augustusplatz 10, D-04109 Leipzig, Germany
http://tommaso-soru.it !
Proceedings http://www.www2015.it/documents/proceedings/proceedings/p1025.pdf
24