39
Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee

Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee

Embed Size (px)

Citation preview

Workload Matters: Why RDF Databases Need a New Design

Güneş Aluç M. Tamer Özsu Khuzaima Daudjee

Outline

• Why do RDF data management systems need a new design?

• How do we envision RDF data management systems to be re-designed?

A Running Example

1

Tamer ?post ?person UWaterloohasPost ??? worksAt

likestaggedInretweetsfavorites

etc.

Consider the following SPARQL query:

Single Table Layout

2

P S O… … …

hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …

… … …

O S P… … …

Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …

… … …

P O S… … …

worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …

… … …

Single Table Layout

2

P S O… … …

hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …

… … …

O S P… … …

Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …

… … …

P O S… … …

worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …

… … …

Tamer ?posthasPost

Single Table Layout

2

P S O… … …

hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …

… … …

O S P… … …

Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …

… … …

P O S… … …

worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …

… … …

?person UWaterlooworksA

t

Single Table Layout

2

P S O… … …

hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …

… … …

O S P… … …

Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …

… … …

P O S… … …

worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …

… … …

?post ?person???

Single Table Layout

2

O S P… … …

Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …

… … …

P O S… … …

worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …

… … …

?post ?person???

(1) Many irrelevant intermediate result tuples

(2) These tuples are fragmented across the OSP table

(3) Indexes are not very useful in locating the relevant tuple

Group-by-Predicates

3

favorites

S O… …

Bob Post235Gunes Post1Olaf Post234

… …

likes

S O… …

Alice Post23Ken Post24

… …

retweets

S O… …

Gunes Post2358Ken Post23570

… …

taggedIn

S O… …

Alice Post2Bob Post2357Olaf Post23571

… …

Group-by-Predicates

3

favorites

S O… …

Bob Post235Gunes Post1Olaf Post234

… …

likes

S O… …

Alice Post23Ken Post24

… …

retweets

S O… …

Gunes Post2358Ken Post23570

… …

taggedIn

S O… …

Alice Post2Bob Post2357Olaf Post23571

… …

?post ?person???

Group-by-Entities

4

Post2

Post23

Post24

Post2357

Post23571

Post1

Post234

Post235

Post2358

Post23570

Alice X X

Bob X X

Gunes X X

Ken X X

Olaf X X

likestaggedIn retweets

favorites

FacebookEntities

TwitterEntities

Group-by-Entities

4

Post2

Post23

Post24

Post2357

Post23571

Post1

Post234

Post235

Post2358

Post23570

Alice X X

Bob X X

Gunes X X

Ken X X

Olaf X X

likestaggedIn retweets

favorites

FacebookEntities

TwitterEntities

?post ?person???

Group-by-Vertices

5

Post1 ← favorites GunesPost2 ← taggedIn AlicePost23 ← likes AlicePost24 ← likes KenPost234 ← favorites OlafPost235 ← favorites BobPost2357 ← taggedIn BobPost2358 ← retweets GunesPost23570 ← retweets KenPost23571 ← taggedIn Olaf

Group-by-Vertices

5

Post1 ← favorites GunesPost2 ← taggedIn AlicePost23 ← likes AlicePost24 ← likes KenPost234 ← favorites OlafPost235 ← favorites BobPost2357 ← taggedIn BobPost2358 ← retweets GunesPost23570 ← retweets KenPost23571 ← taggedIn Olaf

?post ?person???

Does The Winner Take It All?

• With a single query, we were able to conceptually show problems with existing solutions

• SPARQL workloads that RDF data management systems support – contain a very diverse selection of queries– and these selection of queries dynamically change

6

Does The Winner Take It All?

G. Aluç, O. Hartig, M. T. Özsu and K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In Proc. International Semantic Web Conference, 2014. Forthcoming. 6

1 6 10 15 20 25 30 35 40 44 49 54 59 64 69 73 78 83 88 93 981

10

100

1000

10000

100000

RDF-3x Fastest System

Percentage of Test Query Templates

Mea

n Q

uery

Exe

cutio

n Ti

me

(mill

isec

onds

)

Does The Winner Take It All?

G. Aluç, O. Hartig, M. T. Özsu and K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In Proc. International Semantic Web Conference, 2014. Forthcoming. 6

Outline

• Why do RDF data management systems need a new design?

• How do we envision RDF data management systems to be re-designed?

Group-by-Query

7

RDFPhysical Design

Fixed Workload-Driven

Single Table LayoutGroup-by-PredicatesGroup-by-EntitiesGroup-by-Vertices

Group-by-Query

Outline

• Why do RDF data management systems need a new design?

• How do we envision RDF data management systems to be re-designed?– Group-by-Query Representation– Partial Tuning

Group-by-Query

7

Tamer ?post ?person UWaterloohasPost worksAt???

Group-by-Query

7

Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn

Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn

Post235favoriteshasPost

Tamer Post23 Bob UWaterloohasPost worksAtlikes

Post2taggedInhasPost

Group-by-Query

7

Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn

Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn

Post235favoriteshasPost

Tamer Post23 Bob UWaterloohasPost worksAtlikes

Post2taggedInhasPost

Group-by-Query

7

Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn

Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn

Post235favoriteshasPost

Tamer Post23 Bob UWaterloohasPost worksAtlikes

Post2taggedInhasPost

Group-by-Query

7

Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn

Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn

Post235favoriteshasPost

Tamer Post23 Bob UWaterloohasPost worksAtlikes

Post2taggedInhasPost

Group-by-Query (Advantages)

• Triples relevant to the evaluation of a query are physically clustered

• Indexes are more efficient in localizing query evaluation to only the relevant triples

• Fewer intermediate result tuples are generated

8

Group-by-Query (Advantages)

8

ChallengeDynamism

9

1) Types of queries2) Parts of the database that are being queried3) Hotspots

ChallengeDynamism

9

Outline

• Why do RDF data management systems need a new design?

• How do we envision RDF data management systems to be re-designed?– Group-by-Query Representation– Partial Tuning

Proposal #1Updating Physical Storage Layout

10

Initially, triples are not clustered in the storage system for any particular workload

Proposal #1Updating Physical Storage Layout

10

As queries are executed (that is, as triples flow through the cache), there is an opportunity to cluster (hot) triples that are co-accessed within the same query or across multiple queries

Proposal #1Updating Physical Storage Layout

10

Assume a hash function (oracle) decides on a good placement of triples and that the hash function is capable of adapting to changing workloads

Proposal #1Updating Physical Storage Layout

10

Then, one of the challenges is to develop this hash function

Proposal #2Partial Indexing

11

On top of the aforementioned scheme, consider an index which

false positively returns irrelevant triples (striped)for some queries in the workload

Proposal #2Partial Indexing

11

This is no big deal because, these false positive triples can be eliminated from the query evaluation pipeline, w/ just a little bit of extra computational cost

On the other hand, this index is much easier to update and maintain

Proposal #2Partial Indexing

11

Proposal #n…

• In the paper

11

Conclusions

• Problems w/ fixed, workload-oblivious approaches

• Purely workload-driven design is compelling but not trivial especially when it comes to adapting to dynamic workloads

12