Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande

Towards Scalable RDF Graph Analytics on MapReduce

Padmashree RavindraVikas V. DeshpandeKemafor Anyanwu

{pravind2, vvdeshpa, kogan}@ncsu.edu

COUL - semantic COmpUting research

Lab

IntroductionGrowing interest in exploiting RDF

data for decision-making Requires support for analytical-style

querying

e.g. : Sales (Cust, prod, price, loc, month, year)

* For each prod, count for each month of 2008, the sales that were between previous month’s avg sale and next month’s avg sale

- More complex than traditional SPJ queries

- Often include multiple groupings and / or aggregations

- Next release of SPARQL expected to include such constructs

(prev_avg_sale,

next_avg_sale)

Prod Month Count

Prod1 Feb 3

* Example from [1]

Analytical Query ProcessingTraditional OLAP techniques

Requires star / snowflake schema Enterprise-scale

But Semantic Web data (RDF) Semi-structured (labeled graphs)Absence of star-like schema Billion triple data sets

Goal : Exploit MapReduce-based frameworks to develop a scalable, cost-effective platform for Semantic Web analytics.

MapReduce-based Data Processing

High-level dataflow languages - Pig Latin, DryadLINQ, HiveQL, JAQL

Hybrid approach - HadoopDB [5] MapReduce in RDF processing

Graph pattern queries [8], [9] Graph closure computation [10]

RAPID [6] Succinct expression of complex queries Optimize multiple groupings /

aggregations

RDF data modelStatements (triples) Graph representationSub Prop Obj

R1 type Ranking

R1 pageRank 11

R1 pageURL Url1

R1 avgDuration 97

UV1 type UserVisits

UV1 srcIP 158.112.27.3

UV1 destURL url1

UV1 adRevenue 339.08142

UV1 visitDate 1979/12/12

UV1 userAgent SCOPE

UV1 cCode VNM

UV1 iCode VNM-KH

UV1 sKeyword comets

UV1 avgTime 3

Rankings

UserVisits

Groups = Stars

SPARQL Query Matching graph pattern

Traditional Querying of RDF Graph pattern matching

E.g. Get details about all pages visited by particular users between “1979/12/01” and “1979/12/30”

Example Analytical Query on RDF data

Compute the average pageRank and total adRevenue for all pages visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30

Pattern matchingStar sub graphs – Rankings, UserVisitsJoin between the stars

Grouping based on value of srcIP propertyAggregation on value of pageRank and adRevenue

Pig : Data Processing Express data processing tasks using

high-level query primitives usability, code reuse, automatic optimizationPig Latin data model : atom, tuple, bag

(nesting) Operators : LOAD, STORE, JOIN, GROUP BY,

COGROUP, FOREACH, SPLIT, aggr. functions Extensibility support via UDFs Operators compile into MapReduce jobs

Partition REL A using values in age column ($1)

SPLIT A into minors IF $1 < 18, majors IF $1 >= 18;

Equijoin on REL A (column 0) and REL B (column 1) JOIN A by $0, B by $1;

Package tuples

JOIN A by $1, B by $0;

Compiling Pig Latin’s JOIN to MapReduce

$0 $1

C1 P1

C1 P2

C2 P1

$0 $1

P1 18

P2 25

REL A REL B

$0 $1 $2 $3

C1 P1 P1 18

C2 P1 P1 18

C1 P2 P2 25

Annotate based on $1 (join key)

map

reduce

P1

P1

C1 P1 P1 18

Reducer 1

C2 P1 P1 18

P2

Reducer 2

C1 P2 P2 25

P2P1

P2

P1

Pattern Matching in Pig : Approach 1

Sub

Prop Obj

R1 type RankingR1 pageRa

nk11

R1 pageURL

Url1

UV1 type UserVisitsUV1 srcIP 158.112.2

7.3

Sub

Prop Obj

R1 type RankingR1 pageRa

nk11

R1 pageURL

Url1

UV1 type UserVisitsUV1 srcIP 158.112.2

7.3

Sub Prop Obj

R1 type RankingR1 pageRan

k11

R1 pageURL Url1UV1 type UserVisitsUV1 srcIP 158.112.2

7.3

R1

11

Ranking

type

url1

RankingsStarPattern = JOIN triples1 ON Sub, triples2 ON Sub, triples3 ON Sub;

Rankings

triples1 triples2 triples3

Issues- Self-joins on very large relations high I/O costs- Generate meaningless tuples additional

filtering step (R1, type, Ranking, R1, type, Ranking, R1, type, Ranking)

Rankings star pattern = 3-way self-joinUserVisits star pattern = 5-way self-join

pageRank

pageURL

Triple store

LOAD all the RDF triples

Sub Prop ObjR1 type RankingR2 type Ranking


typeRanking

Sub Prop ObjUV1 destURL url1UV2 destURL url1


destURL

Sub Prop ObjR1 pageURL url1R2 pageURL url2


pageURL

Sub Prop ObjR1 pageRank 11R2 pageRank 27


pageRank

Sub Prop ObjUV1 type userVisitsUV2 type userVisits


typeUV

Sub Prop ObjUV1 scrIP 158.112.27.3UV2 scrIP 159.222.21.9


srcIP

Sub Prop ObjUV1 adRev 339.08142UV2 adRev 330.51248


adRev

Sub Prop ObjUV1 visitDate 1979/12/12UV2 visitDate 1980/02/02


visitDate

Ranking = JOIN(compute Star Pattern)

UserVisits = JOIN(compute Star Pattern)

JOIN between Ranking, UserVisits

GROUP BY srcIP

FOREACH group GENERATE aggregations

SPLIT



visitDate

Approach 2: Vertical Partitioning

Filter

LOAD all the RDF triples



typeRanking



destURL



pageURL



pageRank



typeUV



srcIP



adRev



visitDate

Ranking = JOIN(compute Star Pattern)

SPLIT

Approach 2: Vertical Partitioning

Issues SPLIT : Concurrent sub

flowsRisk of Disk spills I/O

costs Structure of intermediate

relations

FILTER

FILTER

FOREACH

Compilation to MapReduce Jobs

JOIN

map1

JOIN

GROUP BY

reduce1

map3

reduce3

map4

reduce4

JOIN

map2

reduce2

Step 1 : Pattern MatchingStep 2 : GroupingStep 3 : Aggregation

Rankings UserVisits

Our Approach : RAPID+

Goal : Minimize I/O costs

Strategy:

Concurrent computation of star patterns using grouping-based algorithm

Can improve efficiency using Operator-coalescing and Look-ahead processing

Concurrent Star Pattern Matching

Sub

Prop Obj

R1 type RankingR1 pageRank 11R1 pageURL Url1UV1 type UserVisitsUV1 srcIP 158.112.2

7.3UV1 destURL url1UV1 adRevenue 339.08142UV1 visitDate 1979/12/1

2

Sub Prop ObjR1 type RankingR1 pageRank 11R1 pageURL Url1R1 avgDuration 97UV1 type UserVisitsUV1 srcIP 158.112.2


2UV1 userAgent SCOPEUV1 cCode VNMUV1 iCode VNM-KHUV1 sKeyword cometsUV1 avgTime 3

Use grouping-based algorithm on a triple storage model- GROUP BY Subject

More efficient if prior filtering of irrelevant triples`

Filter irrelevant properties

Compute the average pageRank and total adRevenue for all pageURLs visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30

Sub Prop ObjR1 type RankingR1 pageRank 11R1 pageURL Url1R1 avgDuration 97UV1 type UserVisitsUV1 srcIP 158.112.2


2UV1 userAgent SCOPEUV1 cCode VNMUV1 iCode VNM-KHUV1 sKeyword cometsUV1 avgTime 3

Ranking

UserVisits

Concurrent Star Pattern Matching -2

Filter irrelevant triples by coalescing LOAD and FILTER operators

input = LOAD ‘\data’ using loadFilter ( pageRank, pageURL, type:Ranking, destURL, adRevenue, srcIP, visitDate, type:UserVisits )

LOAD

FILTER

map1

LOAD

loadFilter

Our Approach

OperatorCoalescing

Savings by Coalescing:Context switchingParameter passingMultiple handling of same data

Using Pig Latin

map1

Grouping-based Pattern Matching

Sub

Prop Obj

R1 type Ranking

R1 pageRank 11

R1 pageURL Url1

UV1 type UserVisits

UV1 srcIP 158.112.27.3

UV1 destURL url1

UV1 adRevenue 339.08142

UV1 visitDate 1979/12/12

GROUP BY

Subject

BUT heterogeneous bags

starSubgraphs = GROUP input BY $0;

Filtering the GroupsBUT all possible sub patterns computedFilter non-matching sub patterns

Value-based filtering validate each sub graph against filter condition

Structure-based filtering eliminate sub graphs with missing properties

Missing srcIPvisitDate between 1979/12/01

and 1979/12/30

Joining the Stars : Look-ahead Processing

Annotate based on Subject

Process each bag Annotate based on value of join property

Group by SubjectProcess each bag Structure-based and value-based filtering

Join between the star sub graphs

map map

reduce reduce

Star Pattern Matching Cycle

Next Cycle(Joining the Stars)

Group by SubjectProcess each bag Structure-based and value-based filtering Annotate based on value of join prop

No repeated processing

Example : Look-ahead Processing

Star Pattern Matching Joining the Stars

Structure-based filteringValue-based filtering

Look-Ahead - Annotate bag based on join key

Join between the star sub

graphsEliminate properties irrelevant for future processing (join and filter prop) Minimize size of intermediate results

Comparison : Pig vs RAPID+Pig Approach RAPID+

Multiple map-reduce cycles- N star sub graphs N cycles

Single cycle- N star sub graphs 1 cycle

Potential for increased I/O (i)Disk spills (SPLIT operator)(ii)Materialization of several intermediate results due to sequential computation of star patterns

Minimized I/O(i)Filtering in triple storage model + load-filter coalescing(ii)Concurrent computation of star patterns (single intermediate result)

Would require advanced optimization techniques- Introduce project operator to eliminate unneeded columns

Smaller intermediate result sizes- Eliminate tuples and columns not necessary in future steps of processing

Not applicable Minimize repeated tuple handling by look-ahead processing

Case Study Setup: 5-node / 20-node Hadoop clusters

on NCSU’s Virtual Computing Lab [13] Dataset: Synthetic benchmark data set

[4] Tasks: Baseline case

Task A (PM) – basic pattern matching(2 star patterns and a join between the stars) Task B (PM+GA) – pattern matching with

grouping and aggregation (two look-ahead processing opportunities)

Experimental Results

Cost Analysis for Task A (PM)5-node cluster

Cost Analysis for Task B (PM+GA)5-node cluster

Experimental ResultsScalability Study 5-node vs 20-nodes

1.8GB per node 2.8GB per node

Conclusion and Ongoing work

Promising results even for baseline caseFurther opportunities for improvement

First-class operators vs UDFs Exploit combiners during aggregations More efficient data structures for

processing bags Further look-ahead optimizations during

multiple groupings and aggregations

References[1] D. Chatziantoniou M. Akinde, T. Johnson, and S. Kim “The MD-join: an operator for Complex

OLAP” ICDE 2001, 108–121[2] J. Dean and S. Ghemawat. “MapReduce : Simplified Data Processing on Large Clusters”. In Proc.

Of OSDI'04, 2004[3] C. Olston, B. Reed, U.Srivastava, R. Kumar and A.Tomkins. “Pig Latin: a not-so-foreign language

for data processing”. In Proc. of ACM SIGMOD2008, p.1099 -1110 [4] A.Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. "A

Comparison of Approaches to Large-Scale Data Analysis", In Proc. of SIGMOD 2009[5] Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A.: HadoopDB: An Architectural

Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009[6] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the

semantic web. ISWC 2009[7] Yu,Y., Isard, M., Fetterly,D., Badiu,M ., Erlingsson,U., Gunda,P.K. , and Currey,J.:

DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008

[8] A. Newman, Y. Li, J. Hunter. Scalable Semantics – The Silver Lining of Cloud Computing. eScience, 2008. IEEE Fourth International Conference on eScience '08. 2008

[9] Newman, A., Hunter, J., Li, Y-F., Bouton, C., Davis, M.: A Scale-Out RDF Molecule Store for Distributed Processing of Biomedical Data. HCLS'08 at WWW 2008.

[10] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen, "Scalable Distributed Reasoning using MapReduce," in Proceedings of the ISWC ‘09, 2009

[11] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007

[12] Prud'hommeaux, E., Seaborne, A.: SPARQL query language for RDF. Technical report, World Wide Web Consortium (2005) http://www.w3.org/TR/rdf-sparql-quer

[13] VCL Setup at NC State University, https://vcl.ncsu.edu/ [14] HiveQL, http://hadoop.apache.org/hive/ [15] JAQL, http://code.google.com/p/jaql[16] RDF, http://www.w3.org/RDF/

Thank You!

Documents

Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande