An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce

An Intermediate Algebra for Optimizing RDF Graph

Pattern Matching on MapReducePadmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu

COUL – Semantic COmpUting research Lab

Outline IntroductionBackground

MapReduce, Pig and Join Processing RDF Graph Pattern Matching in Pig

Approach TripleGroup data model and Nested TripleGroup

Algebra (NTGA) Comparing NTGA based plans and Pig Latin

plans for graph pattern matching queriesEvaluationRelated WorkConclusion and Future Work

2/30

Basics: MapReduce Large scale processing of data on a

cluster of commodity grade machinesUsers encode task as map / reduce

functions, which are executed in parallel across the cluster

Apache Hadoop* – open-source implementation

Key Terms Hadoop Distributed File System (HDFS) Slave nodes / Task Tracker – Mappers (Reducers) execute

the map (reduce) function Master node / Job Tracker – manages and

assigns tasks to Mappers / Reducers* http://hadoop.apache.org/

3/30

Supports Partition ParallelismEach MR cycle I/O and communication costs

Data Processing on HadoopJob Tracker

Mapper1map()

Mapper2map()

MapperNmap()

Reducer1reduce()

ReducerMreduce()

DiskDisk Disk………….

………….

Input

Sort / Shuffle

Output

HDFS Reads

Local Writes

HDFS Writes

Remote Reads

Map exec

Reduce exec

(k1, v1)

(k1, v2)

(k1, v3)

(k1, {v1, v2, v3})

(k1, val)

4/30

Joins in Map Reduce

Map phase – scan input records map func. annotates each record

based on join column e.g. (joinKey, Record)

Reduce phase – records with same joinKey collected by same reduce task reduce func. joins the tuples Output written into HDFS

Single Join Workload

5/30

Data Processing in Pig Express data flow using high-level

query primitives usability, code reuse, automatic optimizationPig Latin

Data model : atom, tuple, bag (nesting) Operators : LOAD, STORE, JOIN, GROUP BY,

COGROUP, FOREACH, SPLIT, aggr. functions • Ex.Equijoin on REL A (column 0) and REL B (column 1)

JOIN A by $0, B by $1;Extensibility support via UDFs

Dataflow is compiled into a workflow of MapReduce jobs

6/30

#MR cycles = #Joins = 6(I/O & communication costs) * 6Loads have I/Os as wellExpensive!!! (SPLIT Operator)*

SELECT ?vlabel ?hpage ?price ?prodWHERE{ ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry . ?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .}

Example Pig Query PlanA =

LOADInput.rdfFILTER

(homepage)

B = LOAD

Input.rdfFILTER(label)

T1 = JOIN A ON Sub,

B ON Sub;

C = LOAD

Input.rdfFILTER(country

)T2 = JOIN C ON

Sub, T1 ON

Sub;

STORE

T3 = JOIN H ON Sub,

T7 ON Sub;

…….

H= LOAD

Input.rdfFILTER(product

)

MR1

MR2

MR6

7/30

SELECT ?vlabel ?hpage ?price ?prodWHERE{ ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry . ?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .}

Join between

Stars

Possible Optimizations : m-way Join

JOIN SJ2

Disk

reduce

map

JOIN J1

Disk

reduce

map

HDFS

Input

JOIN SJ1

Disk

reduce

map

MR1

MR2

MR3

SJ1

SJ2

J

1 #MR cycles reduced from

6 to 3

8/30

BUT ?Still expensive! I MR cycle/star-joinMany pattern matching queries involve

multiple star join subpatterns 50% of BSBM* benchmark queries have two

or more star patternsOur proposal:

Coalesce the computation of ALL star-join subpatterns into a single MR cycle

How? Don’t think of them as a set of joins! Think of it as a GROUP BY operation

9/30

Sub Prop Obj

&V1 type Vendor&V1 label Vendor1&V1 country US&V1 homepage www.ven...

&Offer1 type Offer&Offer1 vendor &V1&Offer1 product &P1&Offer1 price 108&Offer1 delDays 2&Offer1 validToDate 01/01/2011

&Offer1 validFromDate 08/01/2011

&Rev1 type Review&Rev1 reviewFor &P1&Rev1 rating1 9&Rev1 reviewer &R1

WHERE{ ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry .

?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .}

GROUPBY Subject

1 MapReduce Cycle!!!

10/30

What are we proposing?A new data model (TripleGroup)

and algebra (Nested TripleGroup Algebra - NTGA) for more efficient graph pattern matching on MapReduce platforms

11/30






12/30

Our Approach : RAPID+ Goal : Minimize I/O and communication costs by reducing MR cycles

Reinterpret and refactor operations into a more suitable (coalesced) set of operators – NTGA

Foundation: Re-interpret multiple star-joins as a grouping operation leads to “groups of Triples” (TripleGroups) instead of n-tuples

different structure BUT “content equivalent”

NTGA- algebra on TripleGroups

13/30

NTGA – Data Model Data model based on nested

TripleGroupsMore naturally capture graphs

TripleGroup – groups of triples sharing Subject / Object component Can be nested at the Object

component

{(&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2)}

{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.vendors….)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

14/30

NTGA Operators…(1)

TG_Unnest – unnest a nested TripleGroup{(&Offer1, price, 108), (&Offer1, vendor,{(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

{(&Offer1, price, 108), (&Offer1, vendor, &V1), (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

TG_Unnest

TG_Flatten – generate equivalent n-tuple(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven...)}

(&V1, label, vendor1, &V1, country, US, &V1, homepage, www.ven...)

TG_Flatten

t1 t2 t3

“Content Equivalence”

15/30

NTGA Operators…(2) TG_GroupFilter – retain only

TripleGroups that satisfy the required query sub structure

Structure-based filtering

TG_GroupFilter

{ (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..) },

{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } ,

{ (&Offer2, vendor, &V2), (&Offer2, product, &P3), (&Offer2, delDays, 1) } }

{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) }

(TG, {price, vendor, delDays, product})

TG TG{price, vendor, delDays, product}

Eliminate TripleGroups with

missing triples (edges)

16/30

NTGA Operators…(3) TG_Filter – filter out triples that do not

satisfy the filter condition (FILTER clause) Value-based filtering

TG_Filterprice<200(TG)

{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } ,

{ (&Offer3, vendor, &V2), (&Offer3, product, &P3), (&Offer3, price, 306), (&Offer3, delDays, 1) } }


TG{price, vendor, delDays, product}

Eliminate TripleGroups with triples that do not satisfy filter condition


17/30

NTGA Operators…(4) TG_Join – join between different structure

TripleGroups based on join triple patterns

TG_Join



(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, ww.ven...)}

TG{label, country, homepage}

{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

?o vendor ?v ?v country ?vcountry

18/30

Pattern Matching using NTGA in Pig

Subject

Property Object

&V1 type VENDOR&V1 label Vendor1&V1 country US&V1 homepage www.ven...

&Offer1 type OFFER

&Offer1 vendor &v1

&Offer1 product &p1

&Offer1 price 108

&Offer1 delDays 2

&Offer1 validToDate 01/01/2011

&Offer1

validFromDate 08/01/2011

Subject

Property

Object

&V1 label Vendor1&V1 country US

&V1 homepage www.ven..

&Offer1 vendor &v1

&Offer1 product &p1

&Offer1 price 108

&Offer1 delDays 2{ (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..) },

{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } }

{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

LoadFilter

StarGroupFilter

RDFJoin

(load +TG_Filter)

(TG_GroupBy+TG_GroupFilter)

(TG_Join)

Mapping to Pig Latin/Relational Algebra

20/30

RDFMap: Efficient Data Representation

Compact representation of intermediate results during TripleGroup based processing

Efficient look-up of triples matching a given Property type via property-based indexing scheme

Ability to represent structure-label information for groups of triples.

21/30






22/30

Evaluation Setup: 5-node to 25-node Hadoop

clusters on NCSU’s Virtual Computing Lab*

Dataset: Synthetic benchmark dataset generated using BSBM** tool(max. 40GB data – approx. 175 million triples)

Evaluation of Pig (Pig_opt) vs. RAPID+ Task 1 – Scalability with size of RDF graphs Task 2 – Scalability with denser star patterns Task 3 – Scalability with increasing cluster

sizes

*https://vcl.ncsu.edu **http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/

23/30

Experimental Results…(1)Cost Analysis across Increasing size of RDF graphs (5-node)

Key Observations:Benefit of TripleGroup based processing seen across data sizes – up

to 60% in some cases (RAPID+ << Pig_opt < Pig) Pig approaches did not complete for large data size

24/30

Query

#Triple

Patterns

#Edges in

Stars

%gain

Q1 3 1:2 56.8

Q2 4 2:2 46.7

Q3 5 2:3 47.8

Q4 6 3:3 51.6

Q5 7 3:4 57.4

Q6 8 4:4 58.4

Q7 9 5:4 58.6

Q8 10 6:4 57.3

Q9* 6 2:4 65.4

Q10* 10 2:4:4 61.5

Experimental Results…(2)Cost Analysis across Increasing Star Density

%gain of RAPID+ over Pig (10-node / 32GB)

(5-node / 20GB)

Key Observations: RAPID+ maintains a consistent %gain

of 50% across the varying density Costs savings by eliminating redundant

Subject values and join triples

25/30

Experimental Results…(3)Cost Analysis across Increasing Cluster Sizes

Query pattern with three star-joins and two chain-joins (32GB)

Key Observations: RAPID+ has 56% gain for 10-

node cluster over Pig approaches

Pig approaches catch up with increasing cluster size Increasing nodes decrease

probability of disk spills with the SPLIT approach

RAPID+ still maintains 45% gain across the experiments

26/30

And some Updates… Additional evaluation –

Up to 65% performance gain on another synthetic benchmark dataset* for three/two star-join queries

Experiments extended to 1 billion 3-ary triples (43GB) – 31% (10-node) to 41% (30-node) performance gain

RAPID+ now includes a SPARQL interface

In Future: Cost-based optimizations to select Pig vs. NTGA execution plans

Join us for a demo of RAPID+@VLDB2011**

**Kim, H., Ravindra, P., Anyanwu, K : From SPARQL to MapReduce: The Journey using a Nested TripleGroup Algebra. To appear In: Proc. International Conference on Very Large Data Bases. (VLDB 2011)

*Pavlo, A.,Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M : A Comparison of Approaches to Large-scale Data Analysis. In Proc. Of the 35th SIGMOD International Conference on Management of data (2009)

27/30

Related WorkMapReduce-based Processing

Indexing

Partitioning Schemes Rule-based

Optimizations[Newman08] * [Hunter08]*[Afrati10]

[Husain10]*, Hadoop++

[Dittrich10],HadoopDB[Abadi09]

Reasoning[Urbani07] *

High-levelDataflow

LanguagesPig Latin

[Olston08], [HiveQL][JAQL]

Other Extensions

Map-Reduce-Merge

[Yang07]

28/30

Conclusion TripleGroup based processing for

evaluating pattern matching queries on MapReduce platformsNTGA Operators re-factored to

minimize #MR cycles minimize costs Reduce costs of repeated data

handling via operator coalescingEfficient data representation

(RDFMap)

29/30

References[Dean04] Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun.

ACM 51 (2008) 107–113[Olston08] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign

language for data processing. In: Proc. International Conference on Management of data. (2008)[Abadi09] Abouzied, A., Bajda-Pawlikowski, K., Huang, J., Abadi, D.J., Silberschatz, A.: Hadoopdb in

action: building real world applications. In: Proc. International Conference on Management of data. (2010)

[Sridhar09] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009

[Yu08] Yu,Y., Isard, M., Fetterly,D., Badiu,M ., Erlingsson,U., Gunda,P.K. , and Currey,J.: DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008

[Newman08] Newman, A., Li, Y.F., Hunter, J.: Scalable semantics: The silver lining of cloud computing. In: eScience. IEEE International Conference on. (2008)

[Hunter08] Newman, A., Hunter, J., Li, Y., Bouton, C., Davis, M.: A scale-out rdf molecule store for distributed processing of biomedical data. In: Semantic Web for Health Care and Life Sciences Workshop. (2008)

[Urbani07] Urbani, J., Kotoulas, S., Oren, E., Harmelen, F.: Scalable distributed reasoning using mapreduce. In: Proc. International Semantic Web Conference. (2009)

[Abadi07] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007

[Dittrich10] Dittrich, J., Quiane-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). VLDB 2010/PVLDB

[Yang07] Yang, H., Dasdan, A., Hsiao, R., Parker Jr., D.S.: Map-reduce-merge: simplified relational data processing on large clusters. SIGMOD 2007

[Afrati10] Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proc. International Conference on Extending Database Technology. (2010)

[Husain10] Husain, M., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data intensive query processing for large rdf graphs using cloud computing tools. In: Cloud Computing (CLOUD), IEEE International Conference on. (2010)

[HiveQL] http://hadoop.apache.org/hive/ [JAQL], http://code.google.com/p/jaql

30/30

Thank You!

EnvironmentNode Specifications

Single / duo core Intel X86 2.33 GHz processor speed 4G memory Red Hat Linux

Pig 0.5.0 Hadoop 0.20

Block size 256MB

Benchmark Data*Log files of HTTP server trafficColumn-delimited text file

Rankings:pageRank | PageURL | avgDurationUserVisits:sourceIPAddr | destinationURL | visitDate | adRevenue | UserAgent | cCode | lCode | sKeyword | avgTimeOnSite

*Pavlo, A.,Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M : A Comparison of Approaches to Large-scale Data Analysis. In Proc. Of the 35th SIGMOD International Conference on Management of data (2009)

Scripts (Q1)A = load '/data/' using PigStorage(' ');A1 = filter A by $1 eq 'pageRank' or $1 eq 'pageURL' or $1 eq

'destURL' or $1 eq 'srcIP' or $1 eq 'adRevenue' or ($1 eq 'type' and ($2 eq 'Ranking' or $2 eq 'UserVisits'));

B = group A1 by $0 PARALLEL 5;C = foreach B generate flatten(ReassembleRDF($1,'pageURL|

destURL','1'));D = group C by $0 PARALLEL 5;E = foreach D generate

flatten(ReassembleRDF($1,'srcIP|','2')) as (srcIP:chararray, vals:bytearray);

store E into '/q1_app1';

Scripts (Q1)A1 = load '/data/' using PigStorage(' ');split A1 into pageRank IF $1 eq 'pageRank',srcIP IF $1 eq 'srcIP‘, pageURL IF $1 eq 'pageURL',destURL IF $1 eq 'destURL‘, adRevenue IF $1 eq 'adRevenue',typeRanking IF $1 eq 'type' and $2 eq 'Ranking',typeUV IF $1 eq 'type' and $2 eq 'UserVisits';Ranking = join pageURL by $0, pageRank by $0, typeRanking

by $0 PARALLEL 5;UserVisits = join srcIP by $0, destURL by $0, adRevenue by

$0, typeUV by $0 PARALLEL 5;C1 = join Ranking by $2, UserVisits by $5 PARALLEL 5;D1 = foreach C1 generate $11, $17, $5;store D1 into '/q1_app2';

Experiment ResultsPercentage Performance Gain = (exec time 1) – (exec time 2) (exec time 1)

Possible Optimizations (2) Coalesce join operations into as few

MR cycles as possibleCompute star patterns via m-way

JOIN Star-join using m-way JOIN = 1 MR

cycle Reduced #MR cycles Reduced I/O

+ communication costs

Structured Data Processing in Pig

srcIP destURL

visitDate

adRevenue

…

158.112.27.3

url1 1979/12/12

339.08142

….

158.112.27.3

url5 1979/12/15

180.334 ….

150.121.18.6

url1 1979/12/28

550.7889 ….

… … … … …pageRank

pageURL

avgDur

11 url1 9623 url2 318 url3 87

… … …

UserVisits

Ranking

Query: Retrieve the pageRank and adRevenue of pages visited by particular users between “1979/12/01” and “1979/12/30”

LOADUserVisits

LOADRanking

FILTER(visitDate)

JOIN UserVisits ON destURL,

Ranking ON pageURL;

STORE

Package tuples

JOINUserVisits ON

destURL,Ranking ON pageURL;

JOIN: Pig Latin MapReduce UserVisits Ranking

Annotate based on join key

map

reduce

Reducer 1 Reducer 2158.112.27.3

url1

url1

11 …

srcIP destURL

visitDate adRev

…

158.112.27.3

url1 1979/12/12

339.081

…

158.112.27.3

url2 1979/12/15

180.334

…

150.121.18.6

url1 1979/12/28

550.78

…

url2url1

pageRank

pageURL

avgDur

11 url1 96

23 url2 3

url1url2

url1

url1

150.121.18.6

url1

url1

11 …

url2158.112.2

7.3url2

url2

3 …

… srcIP destURL

pageURL

pageRank

…

… 158.112.27.3

url1 url1 339.081 …

… 150.121.18.6

url1 url1 550.78 …

… 158.112.27.3

url2 url2 180.334 …

Subject Prop Object(&UV1 srcIP 158.112.27.3)

RDF Data Model(Resource Description Framework)

Statements (triples) Graph representationSub Prop Obj&R1 type Ranking&R1 pageRank 11&R1 pageURL Url1&R1 avgDuration 97&UV1

type UserVisits

&UV1

srcIP 158.112.27.3

&UV1

destURL url1

&UV1

adRevenue 339.08142

&UV1

visitDate 1979/12/12

&UV1

userAgent SCOPE

&UV1

cCode VNM

&UV1

iCode VNM-KH

&UV1

sKeyword comets

&UV1

avgTime 3

Ranking

UserVisits

Example SPARQL QuerySub Prop Obj&V1 type Vendor&V1 label Vendor1&V1 country US&V1 homepage www.ven...

&Offer1 type Offer

&Offer1 vendor &V1

&Offer1 product &P1

&Offer1 price 108

&Offer1 delDays 2

&Offer1 validToDate 01/01/2011

&Offer1

validFromDate 08/01/2011

&Rev1 type Review&Rev1 reviewFor &P1&Rev1 rating1 9&Rev1 reviewer &R1

Data: Description of Vendors, their product Offers, and Reviews of products (BSBM* dataset)

Query: Retrieve the details of US-based Vendors

SELECT ?vlabel ?hpageWHERE {?v type Vendor . ?v country ?vcountry . ?v label ?vlabel . ?v homepage ?hpage .}FILTER (?vcountry = “US”);

*http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/

Documents

An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce