25
WORQ: Wo rkload-Driven R DF Q uery Processing Department of Computer Science Purdue University Amgad Madkour (Purdue) Ahmed Aly (Google) Walid G. Aref (Purdue)

WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

WORQ: Workload-Driven RDF Query Processing

Department of Computer SciencePurdue University

Amgad Madkour (Purdue)Ahmed Aly (Google)Walid G. Aref (Purdue)

Page 2: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

IntroductionRDF Data Is Everywhere

• RDF is an integral component in many systems:­ Semantic Search, Smart Governments (Data.gov), Medical Systems

• (Linked) RDF data contains very rich relations:­ Data.gov – 5 billion triples­ Linked Cancer Genome Atlas – 7.36 billion triples­ US Census Data – 1 billion triples

• Cloud-based systems are ideal for RDF data management (e.g., Storage, Query Processing)

Figure: Linked RDF Data Cloud containing thousands of datasets

2

Page 3: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

IntroductionProcessing RDF Queries

• Network shuffling overhead degradesquery performance in a distributed environment

• Intermediate results represent the data that satisfies the binary join and contributes to the final result of the query

• Reducing the network shuffling relies on how the data is partitioned across the nodes and the intermediate results size

3

SELECT ?x ?y WHERE {?x :mention :Mary . ?x :tweet ?y . }

mention

SUB OBJ

:John :T1

:Mike :T2

Mike :T3

:Alex :T4

tweet

SUB OBJ

:John :Mary

:John :Mike

:Alex :Mary

:Sally :Mike

:Mary :John

Original Data

Reductions

Join(mention sub , tweet sub )

Join(tweet sub , mention sub )

SUB OBJ

:John :Mary

:John :Mike

:Alex :Mary

SUB OBJ

:John :T1

:Alex :T4

Page 4: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Problem Statement

• Data partitioning incurs a preprocessing overhead as it needs to be performed over the whole data

• Intermediate results may contain redundant data triples that do not match all the query joins

• Caching the unique query results incurs significant memory storage overhead

4

Page 5: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Proposal• We present online method for computing reductions of RDF data

using Bloom filters

• We present workload-driven partitioning of RDF triples that can join together in order to minimize the network shuffling overhead

• We show that caching the RDF join reductions can boost the query performance while keeping the cache size minimal

• We study an efficient technique for answering RDF queries with unbound properties using Bloom filters

5

Page 6: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Online Reduction of RDF DataJoin Patterns

• SPARQL queries consist of Basic Graph Patterns (BGP)

• Every BGP consists of a set of triples

• Join patterns represent correlations between triples in a SPARQL Basic Graph Pattern (BGP)

SELECT ?x ?y ?wWHERE ?x :tweet :T1?x :mention ?y?y :likes ?w

tweet_S_JOIN_mention_Smention_S_JOIN_tweet_Smention_O_JOIN_likes_Slikes_S_JOIN_mention_O

Join Patterns

mention

likes

tweet

6

Page 7: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Subject Object

:John :Mary

:John :Mike

:Alex :Mary

:Sally :Mike

:Mary :John

Online Reduction of RDF DataBloom Join

Subject Object

:John :T1

:Mike :T2

:Mike :T3

:Alex :T4

:John :Alex:Sally:Mary

:John :Mike :Alex

SELECT ?x ?y WHERE

?x :mention :Mary . ?x :tweet ?y .

joinx(:mentionsub,:tweetsub)

SUB OBJ

:John :Mary

:John :Mike

:Alex :MaryBloomFilter sub(tweet)

BloomFilter sub(mention)

SUB OBJ

:John :T1

:Alex :T4

BGPJoin

mention tweet

Reduced Triples

Probe

ProbeSelection(:Mary)

?x ?y

:John :T1

:Alex :T4

Result

JoinSPARQL Query

Determine join patterns

mention tweet

Page 8: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Online Reduction of RDF DataN-ary Join

8

SELECT ?x ?y ?z ?w WHERE ?x :mention ?y?x :tweet ?z?x :likes ?w

Subject Object

:John :Mary

:John :Mike

:Alex :Mary

:Sally :Mike

:Mary :John

Subject Object

:John :T1

:Mike :T2

:Mike :T3

:Alex :T4

Subject Object

:John :Alex

:Sally :Mike

mention tweet likes

:John :Alex:Sally:Mary

:John :Mike:Alex

:John:Sally

Subject Object

:John :Mary

:John :Mike

Subject Object

:John :T1

Subject Object

:John :Alex

?y ?y ?z ?w

:John :Mary :T1 :Alex

:John :Mike :T1 :Alex

Reduction

Reduction

Reduction

Result

Query

Computed from

Page 9: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Online Reduction of RDF DataCaching

9

SELECT ?x ?y WHERE {?x :mention :Mary . ?x :tweet ?y . }

mention

SUB OBJ

:John :T1

:Mike :T2

Mike :T3

:Alex :T4

tweet

SUB OBJ

:John :Mary

:John :Mike

:Alex :Mary

:Sally :Mike

:Mary :John

Original Data

Reductions

Join(mention sub , tweet sub )

Join(tweet sub , mention sub )

SUB OBJ

:John :Mary

:John :Mike

:Alex :Mary

SUB OBJ

:John :T1

:Alex :T4

Selection(:Mary)

CACHED

Page 10: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Workload-Driven PartitioningOverview

10

Subject Object

:John :Mary

:John :Mike

:Alex :Mary

:Sally :Mike

:Mary :John

Machine 1 Machine 2 Machine 3

Subject Object

:John :T1

:Mike :T2

:Mike :T3

:Alex :T4

:John :Mary:John :Mike

:John :T1

:Alex :Mary:Mary :John

:Alex :T4

:Sally :Mike

:Mike :T2:Mike :T3

mention tweet

mention tweet

Page 11: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Workload-Driven PartitioningProposal

SELECT ?x ?y ?wWHERE ?x :tweet :T1?x :mention ?y?y :likes ?w

Reductions Reduction ID

Reduction(tweetsub, mentionsub) R1

Reduction(mentionsub, tweetsub) R2

Reduction(likessub, mentionobj) R3

Subject Object

:John :T1

:Alex :T4

Reduction 1 (R1)

Subject Object

:John :Mary

:John :Mike

:Alex :Mary

Reduction 2 (R2)

Subject Object

:John :Alex

Reduction 3 (R3)

11

Machine 1

Machine 2

:John :T1

:Alex :T4

:John :Alex

:John :Mary:John :Mike

Part

ition

ing

R1

R1

R3

R2

Possible Reductions Redu

ctio

ns:Alex :Mary R2

Page 12: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Queries with Unbound PropertiesOverview

SELECT ?x ?zWHERE ?x ?z :Mike

Scan All Tables

QUERY: Check all tables for Obj = ‘:Mike’

12

Page 13: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Queries with Unbound PropertiesProposal

SELECT ?xWHERE {:Mary ?x ?y . }

13

:John :Alex:Sally:Mary

:John :Mike :Alex

BloomFilter sub (:tweet)

BloomFilter sub (:mention)

:John :Sally

BloomFilter sub (:like)

[MATCH]

[DOES NOT MATCH]

[FALSE POSITIVE MATCH]

Probe all existing Bloom Filters

IDENTIFICATION VERIFICATION

:mention :like

Filter sub

(:Mary)Filter sub

(:Mary)

1 Entry 0 Entry

?x

:mention

Result

Page 14: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Experimental Setup• Systems

­ WORQ: Implemented inside Knowledge Cubes (KC)­ S2RDF: State of the art Spark-based RDF engine

• Benchmarks­ WatDiv

­ Dataset: 1 Billion Triple, Query Workload: 5K queries­ Patterns: Covers 100 diverse SPARQL patterns, each containing 50 variations

­ Unbound Property Queries: 500 queries­ LUBM­ Dataset: 1 Billion Triple, Query Workload: 1K queries

­ Patterns: Covers 20 diverse SPARQL patterns­ YAGO­ Dataset: 245 million triples

14

GitHub Homepagehttps://github.com/amgadmadkour/knowledgecubes

Page 15: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Number of Files

15

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

LUBM 1B WatDiv 1B YAGO2s

Num

. File

s (C

ount

)

Datasets

VP WORQ S2RDF

Page 16: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Data Size on HDFS

16

1.E+00

1.E+01

1.E+02

1.E+03

LUBM 1B WatDiv 1B YAGO2s

Stor

age

Size

(G

B)

Datasets

VP WORQ S2RDF

Page 17: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Preprocessing Time

17

0.E+00

5.E+03

1.E+04

2.E+04

2.E+04

3.E+04

3.E+04

LUBM 1B WatDiv 1B YAGO2sPrep

roce

ssin

g Ti

me

(sec

)

Datasets

VP WORQ S2RDF

Page 18: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Query Execution PerformanceWorkload Generators

18

0

10

20

30

WatDiv LUBM

Exec

utio

n Ti

me

(sec

)

Datasets

WORQ S2RDF

0

5

10

15

20

WatDiv LUBM

Exec

utio

n Ti

me

(hou

rs)

Datasets

WORQ S2RDF

Mean Execution Time Total Execution Time

5000 queries over WatDiv (1 Billion triples) and1000 queries over LUBM (1 Billion triples)

Page 19: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Query Execution PerformanceQuery Patterns

19

1.E+02

1.E+03

1.E+04

1.E+05

1 3 5 7 9 11 13 15 17 19

Exec

utio

n Ti

me

(ms)

Query Patterns

WORQ S2RDF

5.E+02

5.E+03

5.E+04

1 3 5 7 9 11 13 15 17 19

Exec

utio

n Ti

me

(ms)

Query Patterns

WORQ S2RDF

WatDiv 1 Billion dataset LUBM 1 Billion dataset

Page 20: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Query Execution PerformanceQuery Patterns

20

400

4000

40000

2 3 4 5 6 7 8 9 10 11 12

Mea

n Ex

ecut

ion

Tim

e (m

s)

Number of query triples

WORQ S2RDF

500

5000

50000

2 3 4 5 6 7 8 9Mea

n Ex

ecut

ion

Tim

e (m

s)

Number of joins

WORQ S2RDF

Mean execution time over WatDiv 1 Billion

Page 21: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Query Execution PerformanceWorkload-Driven Partitioning

21

5000

5500

6000

6500

7000

WatDiv LUBMExec

utio

n Ti

me

(ms)

Datasets

Workload-driven Static

Page 22: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Query Execution PerformanceCaching

22

0.E+002.E+034.E+036.E+038.E+031.E+041.E+041.E+04

120

140

160

180

110

0112

0114

0116

0118

0120

0122

0124

0126

0128

0130

0132

0134

0136

0138

0140

0142

0144

0146

0148

01

Mem

ory

Usa

ge (

MB)

Timeline

Caching Results Caching Reductions

Page 23: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Performance of Unbound-Property Queries

23

System BSO-Mean BSO-Sum BS-Mean BS-Sum BO-Mean BO-Sum

WORQ 1.25 ms 10.49 min 4.18 ms 34.84 min 3.52 ms 29.34 min

RDF-Table 5.3 ms 44.44 min 3.80 ms 31.67 min 4.35 ms 36.26 min

(BSO) Bound Subject and Object(BS) Bound Subject(BO) Bound Object

Page 24: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

Conclusion

• WORQ is an online method for computing reductions of RDF data using Bloom filters

• WORQ is a method for workload-driven partitioning that minimizes the network shuffling overhead

• WORQ demonstrates how caching reductions can boost the query performance

• WORQ helps answer RDF queries with unbound properties efficiently

24

Page 25: WORQ: Workload-Driven RDF Query Processing · Proposal •We present onlinemethod for computingreductionsof RDF data using Bloom filters •We present workload-driven partitioning

25

Thank You !