41
An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyan COUL Semantic COmpUting research Lab

An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce

  • Upload
    lara

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra , HyeongSik Kim, Kemafor Anyanwu. COUL – Semantic CO mp U ting research L ab. 2/30. Outline. Introduction Background MapReduce , Pig and Join Processing - PowerPoint PPT Presentation

Citation preview

Page 1: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

An Intermediate Algebra for Optimizing RDF Graph

Pattern Matching on MapReducePadmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu

COUL – Semantic COmpUting research Lab

Page 2: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Outline IntroductionBackground

MapReduce, Pig and Join Processing RDF Graph Pattern Matching in Pig

Approach TripleGroup data model and Nested TripleGroup

Algebra (NTGA) Comparing NTGA based plans and Pig Latin

plans for graph pattern matching queriesEvaluationRelated WorkConclusion and Future Work

2/30

Page 3: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Basics: MapReduce Large scale processing of data on a

cluster of commodity grade machinesUsers encode task as map / reduce

functions, which are executed in parallel across the cluster

Apache Hadoop* – open-source implementation

Key Terms Hadoop Distributed File System (HDFS) Slave nodes / Task Tracker – Mappers (Reducers) execute

the map (reduce) function Master node / Job Tracker – manages and

assigns tasks to Mappers / Reducers* http://hadoop.apache.org/

3/30

Page 4: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Supports Partition ParallelismEach MR cycle I/O and communication costs

Data Processing on HadoopJob Tracker

Mapper1map()

Mapper2map()

MapperNmap()

Reducer1reduce()

ReducerMreduce()

DiskDisk Disk………….

………….

Input

Sort / Shuffle

Output

HDFS Reads

Local Writes

HDFS Writes

Remote Reads

Map exec

Reduce exec

(k1, v1)

(k1, v2)

(k1, v3)

(k1, {v1, v2, v3})

(k1, val)

4/30

Page 5: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Joins in Map Reduce

Map phase – scan input records map func. annotates each record

based on join column e.g. (joinKey, Record)

Reduce phase – records with same joinKey collected by same reduce task reduce func. joins the tuples Output written into HDFS

Single Join Workload

5/30

Page 6: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Data Processing in Pig Express data flow using high-level

query primitives usability, code reuse, automatic optimizationPig Latin

Data model : atom, tuple, bag (nesting) Operators : LOAD, STORE, JOIN, GROUP BY,

COGROUP, FOREACH, SPLIT, aggr. functions • Ex.Equijoin on REL A (column 0) and REL B (column 1)

JOIN A by $0, B by $1;Extensibility support via UDFs

Dataflow is compiled into a workflow of MapReduce jobs

6/30

Page 7: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

#MR cycles = #Joins = 6(I/O & communication costs) * 6Loads have I/Os as wellExpensive!!! (SPLIT Operator)*

SELECT ?vlabel ?hpage ?price ?prodWHERE{ ?v homepage ?hpage .   ?v label ?vlabel.   ?v country ?vcountry .   ?o vendor ?v .   ?o price ?price .   ?o delDays ?delDays .   ?o product ?prod .}

Example Pig Query PlanA =

LOADInput.rdfFILTER

(homepage)

B = LOAD

Input.rdfFILTER(label)

T1 = JOIN A ON Sub,

B ON Sub;

C = LOAD

Input.rdfFILTER(country

)T2 = JOIN C ON

Sub, T1 ON

Sub;

STORE

T3 = JOIN H ON Sub,

T7 ON Sub;

…….

H= LOAD

Input.rdfFILTER(product

)

MR1

MR2

MR6

7/30

Page 8: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

SELECT ?vlabel ?hpage ?price ?prodWHERE{ ?v homepage ?hpage .   ?v label ?vlabel.   ?v country ?vcountry .   ?o vendor ?v .   ?o price ?price .   ?o delDays ?delDays .   ?o product ?prod .}

Join between

Stars

Possible Optimizations : m-way Join

JOIN SJ2

Disk

reduce

map

JOIN J1

Disk

reduce

map

HDFS

Input

JOIN SJ1

Disk

reduce

map

MR1

MR2

MR3

SJ1

SJ2

J

1 #MR cycles reduced from

6 to 3

8/30

Page 9: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

BUT ?Still expensive! I MR cycle/star-joinMany pattern matching queries involve

multiple star join subpatterns 50% of BSBM* benchmark queries have two

or more star patternsOur proposal:

Coalesce the computation of ALL star-join subpatterns into a single MR cycle

How? Don’t think of them as a set of joins! Think of it as a GROUP BY operation

9/30

Page 10: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Sub Prop Obj

&V1 type Vendor&V1 label Vendor1&V1 country US&V1 homepage www.ven...

&Offer1 type Offer&Offer1 vendor &V1&Offer1 product &P1&Offer1 price 108&Offer1 delDays 2&Offer1 validToDate 01/01/2011

&Offer1 validFromDate 08/01/2011

&Rev1 type Review&Rev1 reviewFor &P1&Rev1 rating1 9&Rev1 reviewer &R1

WHERE{ ?v homepage ?hpage .   ?v label ?vlabel.   ?v country ?vcountry .

  ?o vendor ?v .   ?o price ?price .   ?o delDays ?delDays .   ?o product ?prod .}

GROUPBY Subject

1 MapReduce Cycle!!!

10/30

Page 11: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

What are we proposing?A new data model (TripleGroup)

and algebra (Nested TripleGroup Algebra - NTGA) for more efficient graph pattern matching on MapReduce platforms

11/30

Page 12: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Outline IntroductionBackground

MapReduce, Pig and Join Processing RDF Graph Pattern Matching in Pig

Approach TripleGroup data model and Nested TripleGroup

Algebra (NTGA) Comparing NTGA based plans and Pig Latin

plans for graph pattern matching queriesEvaluationRelated WorkConclusion and Future Work

12/30

Page 13: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Our Approach : RAPID+ Goal : Minimize I/O and communication costs by reducing MR cycles

Reinterpret and refactor operations into a more suitable (coalesced) set of operators – NTGA

Foundation: Re-interpret multiple star-joins as a grouping operation leads to “groups of Triples” (TripleGroups) instead of n-tuples

different structure BUT “content equivalent”

NTGA- algebra on TripleGroups

13/30

Page 14: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

NTGA – Data Model Data model based on nested

TripleGroupsMore naturally capture graphs

TripleGroup – groups of triples sharing Subject / Object component Can be nested at the Object

component

{(&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2)}

{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.vendors….)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

14/30

Page 15: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

NTGA Operators…(1)

TG_Unnest – unnest a nested TripleGroup{(&Offer1, price, 108), (&Offer1, vendor,{(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

{(&Offer1, price, 108), (&Offer1, vendor, &V1), (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

TG_Unnest

TG_Flatten – generate equivalent n-tuple(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven...)}

(&V1, label, vendor1, &V1, country, US, &V1, homepage, www.ven...)

TG_Flatten

t1 t2 t3

“Content Equivalence”

15/30

Page 16: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

NTGA Operators…(2) TG_GroupFilter – retain only

TripleGroups that satisfy the required query sub structure

Structure-based filtering

TG_GroupFilter

{ (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..) },

{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } ,

{ (&Offer2, vendor, &V2), (&Offer2, product, &P3), (&Offer2, delDays, 1) } }

{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) }

(TG, {price, vendor, delDays, product})

TG TG{price, vendor, delDays, product}

Eliminate TripleGroups with

missing triples (edges)

16/30

Page 17: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

NTGA Operators…(3) TG_Filter – filter out triples that do not

satisfy the filter condition (FILTER clause) Value-based filtering

TG_Filterprice<200(TG)

{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } ,

{ (&Offer3, vendor, &V2), (&Offer3, product, &P3), (&Offer3, price, 306), (&Offer3, delDays, 1) } }

{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) }

TG{price, vendor, delDays, product}

Eliminate TripleGroups with triples that do not satisfy filter condition

TG{price, vendor, delDays, product}

17/30

Page 18: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

NTGA Operators…(4) TG_Join – join between different structure

TripleGroups based on join triple patterns

TG_Join

{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) }

TG{price, vendor, delDays, product}

(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, ww.ven...)}

TG{label, country, homepage}

{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

?o vendor ?v ?v country ?vcountry

18/30

Page 19: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Pattern Matching using NTGA in Pig

Subject

Property Object

&V1 type VENDOR&V1 label Vendor1&V1 country US&V1 homepage www.ven...

&Offer1 type OFFER

&Offer1 vendor &v1

&Offer1 product &p1

&Offer1 price 108

&Offer1 delDays 2

&Offer1 validToDate 01/01/2011

&Offer1

validFromDate 08/01/2011

Subject

Property

Object

&V1 label Vendor1&V1 country US

&V1 homepage www.ven..

&Offer1 vendor &v1

&Offer1 product &p1

&Offer1 price 108

&Offer1 delDays 2{ (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..) },

{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } }

{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}

LoadFilter

StarGroupFilter

RDFJoin

(load +TG_Filter)

(TG_GroupBy+TG_GroupFilter)

(TG_Join)

Page 20: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Mapping to Pig Latin/Relational Algebra

20/30

Page 21: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

RDFMap: Efficient Data Representation

Compact representation of intermediate results during TripleGroup based processing

Efficient look-up of triples matching a given Property type via property-based indexing scheme

Ability to represent structure-label information for groups of triples.

21/30

Page 22: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Outline IntroductionBackground

MapReduce, Pig and Join Processing RDF Graph Pattern Matching in Pig

Approach TripleGroup data model and Nested TripleGroup

Algebra (NTGA) Comparing NTGA based plans and Pig Latin

plans for graph pattern matching queriesEvaluationRelated WorkConclusion and Future Work

22/30

Page 23: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Evaluation Setup: 5-node to 25-node Hadoop

clusters on NCSU’s Virtual Computing Lab*

Dataset: Synthetic benchmark dataset generated using BSBM** tool(max. 40GB data – approx. 175 million triples)

Evaluation of Pig (Pig_opt) vs. RAPID+ Task 1 – Scalability with size of RDF graphs Task 2 – Scalability with denser star patterns Task 3 – Scalability with increasing cluster

sizes

*https://vcl.ncsu.edu **http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/

23/30

Page 24: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Experimental Results…(1)Cost Analysis across Increasing size of RDF graphs (5-node)

Key Observations:Benefit of TripleGroup based processing seen across data sizes – up

to 60% in some cases (RAPID+ << Pig_opt < Pig) Pig approaches did not complete for large data size

24/30

Page 25: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Query

#Triple

Patterns

#Edges in

Stars

%gain

Q1 3 1:2 56.8

Q2 4 2:2 46.7

Q3 5 2:3 47.8

Q4 6 3:3 51.6

Q5 7 3:4 57.4

Q6 8 4:4 58.4

Q7 9 5:4 58.6

Q8 10 6:4 57.3

Q9* 6 2:4 65.4

Q10* 10 2:4:4 61.5

Experimental Results…(2)Cost Analysis across Increasing Star Density

%gain of RAPID+ over Pig (10-node / 32GB)

(5-node / 20GB)

Key Observations: RAPID+ maintains a consistent %gain

of 50% across the varying density Costs savings by eliminating redundant

Subject values and join triples

25/30

Page 26: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Experimental Results…(3)Cost Analysis across Increasing Cluster Sizes

Query pattern with three star-joins and two chain-joins (32GB)

Key Observations: RAPID+ has 56% gain for 10-

node cluster over Pig approaches

Pig approaches catch up with increasing cluster size Increasing nodes decrease

probability of disk spills with the SPLIT approach

RAPID+ still maintains 45% gain across the experiments

26/30

Page 27: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

And some Updates… Additional evaluation –

Up to 65% performance gain on another synthetic benchmark dataset* for three/two star-join queries

Experiments extended to 1 billion 3-ary triples (43GB) – 31% (10-node) to 41% (30-node) performance gain

RAPID+ now includes a SPARQL interface

In Future: Cost-based optimizations to select Pig vs. NTGA execution plans

Join us for a demo of RAPID+@VLDB2011**

**Kim, H., Ravindra, P., Anyanwu, K : From SPARQL to MapReduce: The Journey using a Nested TripleGroup Algebra. To appear In: Proc. International Conference on Very Large Data Bases. (VLDB 2011)

*Pavlo, A.,Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M : A Comparison of Approaches to Large-scale Data Analysis. In Proc. Of the 35th SIGMOD International Conference on Management of data (2009)

27/30

Page 28: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Related WorkMapReduce-based Processing

Indexing

Partitioning Schemes Rule-based

Optimizations[Newman08] * [Hunter08]*[Afrati10]

[Husain10]*, Hadoop++

[Dittrich10],HadoopDB[Abadi09]

Reasoning[Urbani07] *

High-levelDataflow

LanguagesPig Latin

[Olston08], [HiveQL][JAQL]

Other Extensions

Map-Reduce-Merge

[Yang07]

28/30

Page 29: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Conclusion TripleGroup based processing for

evaluating pattern matching queries on MapReduce platformsNTGA Operators re-factored to

minimize #MR cycles minimize costs Reduce costs of repeated data

handling via operator coalescingEfficient data representation

(RDFMap)

29/30

Page 30: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

References[Dean04] Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun.

ACM 51 (2008) 107–113[Olston08] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign

language for data processing. In: Proc. International Conference on Management of data. (2008)[Abadi09] Abouzied, A., Bajda-Pawlikowski, K., Huang, J., Abadi, D.J., Silberschatz, A.: Hadoopdb in

action: building real world applications. In: Proc. International Conference on Management of data. (2010)

[Sridhar09] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009

[Yu08] Yu,Y., Isard, M., Fetterly,D., Badiu,M ., Erlingsson,U., Gunda,P.K. , and Currey,J.: DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008

[Newman08] Newman, A., Li, Y.F., Hunter, J.: Scalable semantics: The silver lining of cloud computing. In: eScience. IEEE International Conference on. (2008)

[Hunter08] Newman, A., Hunter, J., Li, Y., Bouton, C., Davis, M.: A scale-out rdf molecule store for distributed processing of biomedical data. In: Semantic Web for Health Care and Life Sciences Workshop. (2008)

[Urbani07] Urbani, J., Kotoulas, S., Oren, E., Harmelen, F.: Scalable distributed reasoning using mapreduce. In: Proc. International Semantic Web Conference. (2009)

[Abadi07] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007

[Dittrich10] Dittrich, J., Quiane-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). VLDB 2010/PVLDB

[Yang07] Yang, H., Dasdan, A., Hsiao, R., Parker Jr., D.S.: Map-reduce-merge: simplified relational data processing on large clusters. SIGMOD 2007

[Afrati10] Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proc. International Conference on Extending Database Technology. (2010)

[Husain10] Husain, M., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data intensive query processing for large rdf graphs using cloud computing tools. In: Cloud Computing (CLOUD), IEEE International Conference on. (2010)

[HiveQL] http://hadoop.apache.org/hive/ [JAQL], http://code.google.com/p/jaql

30/30

Page 31: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Thank You!

Page 32: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

EnvironmentNode Specifications

Single / duo core Intel X86 2.33 GHz processor speed 4G memory Red Hat Linux

Pig 0.5.0 Hadoop 0.20

Block size 256MB

Page 33: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Benchmark Data*Log files of HTTP server trafficColumn-delimited text file

Rankings:pageRank  |  PageURL  |  avgDurationUserVisits:sourceIPAddr | destinationURL | visitDate | adRevenue | UserAgent | cCode | lCode | sKeyword | avgTimeOnSite

*Pavlo, A.,Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M : A Comparison of Approaches to Large-scale Data Analysis. In Proc. Of the 35th SIGMOD International Conference on Management of data (2009)

Page 34: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Scripts (Q1)A = load '/data/' using PigStorage(' ');A1 = filter A by $1 eq 'pageRank' or $1 eq 'pageURL' or $1 eq

'destURL' or $1 eq 'srcIP' or $1 eq 'adRevenue' or ($1 eq 'type' and ($2 eq 'Ranking' or $2 eq 'UserVisits'));

B = group A1 by $0 PARALLEL 5;C = foreach B generate flatten(ReassembleRDF($1,'pageURL|

destURL','1'));D = group C by $0 PARALLEL 5;E = foreach D generate

flatten(ReassembleRDF($1,'srcIP|','2')) as (srcIP:chararray, vals:bytearray);

store E into '/q1_app1';

Page 35: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Scripts (Q1)A1 = load '/data/' using PigStorage(' ');split A1 into pageRank IF $1 eq 'pageRank',srcIP IF $1 eq 'srcIP‘, pageURL IF $1 eq 'pageURL',destURL IF $1 eq 'destURL‘, adRevenue IF $1 eq 'adRevenue',typeRanking IF $1 eq 'type' and $2 eq 'Ranking',typeUV IF $1 eq 'type' and $2 eq 'UserVisits';Ranking = join pageURL by $0, pageRank by $0, typeRanking

by $0 PARALLEL 5;UserVisits = join srcIP by $0, destURL by $0, adRevenue by

$0, typeUV by $0 PARALLEL 5;C1 = join Ranking by $2, UserVisits by $5 PARALLEL 5;D1 = foreach C1 generate $11, $17, $5;store D1 into '/q1_app2';

Page 36: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Experiment ResultsPercentage Performance Gain = (exec time 1) – (exec time 2) (exec time 1)

Page 37: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Possible Optimizations (2) Coalesce join operations into as few

MR cycles as possibleCompute star patterns via m-way

JOIN Star-join using m-way JOIN = 1 MR

cycle Reduced #MR cycles Reduced I/O

+ communication costs

Page 38: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Structured Data Processing in Pig

srcIP destURL

visitDate

adRevenue

158.112.27.3

url1 1979/12/12

339.08142

….

158.112.27.3

url5 1979/12/15

180.334 ….

150.121.18.6

url1 1979/12/28

550.7889 ….

… … … … …pageRank

pageURL

avgDur

11 url1 9623 url2 318 url3 87

… … …

UserVisits

Ranking

Query: Retrieve the pageRank and adRevenue of pages visited by particular users between “1979/12/01” and “1979/12/30”

LOADUserVisits

LOADRanking

FILTER(visitDate)

JOIN UserVisits ON destURL,

Ranking ON pageURL;

STORE

Page 39: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Package tuples

JOINUserVisits ON

destURL,Ranking ON pageURL;

JOIN: Pig Latin MapReduce UserVisits Ranking

Annotate based on join key

map

reduce

Reducer 1 Reducer 2158.112.27.3

url1

url1

11 …

srcIP destURL

visitDate adRev

158.112.27.3

url1 1979/12/12

339.081

158.112.27.3

url2 1979/12/15

180.334

150.121.18.6

url1 1979/12/28

550.78

url2url1

pageRank

pageURL

avgDur

11 url1 96

23 url2 3

url1url2

url1

url1

150.121.18.6

url1

url1

11 …

url2158.112.2

7.3url2

url2

3 …

… srcIP destURL

pageURL

pageRank

… 158.112.27.3

url1 url1 339.081 …

… 150.121.18.6

url1 url1 550.78 …

… 158.112.27.3

url2 url2 180.334 …

Page 40: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Subject Prop Object(&UV1 srcIP 158.112.27.3)

RDF Data Model(Resource Description Framework)

Statements (triples) Graph representationSub Prop Obj&R1 type Ranking&R1 pageRank 11&R1 pageURL Url1&R1 avgDuration 97&UV1

type UserVisits

&UV1

srcIP 158.112.27.3

&UV1

destURL url1

&UV1

adRevenue 339.08142

&UV1

visitDate 1979/12/12

&UV1

userAgent SCOPE

&UV1

cCode VNM

&UV1

iCode VNM-KH

&UV1

sKeyword comets

&UV1

avgTime 3

Ranking

UserVisits

Page 41: An Intermediate Algebra for  Optimizing RDF Graph Pattern Matching on  MapReduce

Example SPARQL QuerySub Prop Obj&V1 type Vendor&V1 label Vendor1&V1 country US&V1 homepage www.ven...

&Offer1 type Offer

&Offer1 vendor &V1

&Offer1 product &P1

&Offer1 price 108

&Offer1 delDays 2

&Offer1 validToDate 01/01/2011

&Offer1

validFromDate 08/01/2011

&Rev1 type Review&Rev1 reviewFor &P1&Rev1 rating1 9&Rev1 reviewer &R1

Data: Description of Vendors, their product Offers, and Reviews of products (BSBM* dataset)

Query: Retrieve the details of US-based Vendors

SELECT ?vlabel ?hpageWHERE {?v type Vendor . ?v country ?vcountry . ?v label ?vlabel . ?v homepage ?hpage .}FILTER (?vcountry = “US”);

*http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/