Processing SPARQL Queries Over Linked Data— A ...cobweb.cs.uga.edu/~kochut/teaching/8350/Papers/Ontologies/Queryin… · Processing SPARQL Queries Over Linked Data— A Distributed

Processing SPARQL Queries Over Linked Data—A Distributed Graph-based Approach

Peng Peng1, Lei Zou1, M. Tamer Özsu2, Lei Chen3, Dongyan Zhao1

1Peking University, China;2 University of Waterloo, Canada;

3 Hong Kong University of Science and Technology, China; pku09pp,zoulei,[email protected], [email protected], [email protected]

ABSTRACTWe propose techniques for processing SPARQL queries over linkeddata. We follow a graph-based approach where answering a queryQ is equivalent to finding its matches over a distributed RDF datagraph G. We adopt a “partial evaluation and assembly” framework.Partial evaluation results of query Q over each repository—calledlocal partial match—are found. In the assembly stage, we pro-pose a centralized and a distributed assembly strategy. We analyzeour algorithms both theoretically and the experimentally. Extensiveexperiments over both real and benchmark RDF repositories withbillion triples demonstrate the high performance and scalability ofour methods compared with that of the existing solutions.

1. INTRODUCTIONThe semantic web data model, called the “Resource Description

Framework”, or RDF, represents data as a collection of triples ofthe form 〈subject, property, object〉. A triple can be naturally seenas a pair of entities connected by a named relationship or an en-tity associated with a named attribute value. Linked Open Data(LOD), which is the focus of this paper, uses RDF to link webobjects through their Uniform Resource Identifiers. LOD forms adirected graph where each vertex represents a Web document thatcontains data in the form of a set of RDF triples. These documentscan be retrieved by looking up HTTP-scheme based URIs. A pairof document vertices is connected by a directed edge if any of theRDF triples in the source document contains a URI in its subject,predicate, or object position such that a lookup of this URI resultsin retrieving the target document. LOD forms a federated (or virtu-ally integrated) database with the documents stored at autonomoussites some of which have the ability to run SPARQL queries overtheir own data (these are called “SPARQL endpoints”).

In this paper we address the issue of executing SPARQL queriesover this type of federated RDF repositories. The approach we taketo evaluating SPARQL queries is graph-based, which is consistentwith RDF and SPARQL specifications. Consequently, answeringa SPARQL query Q is equivalent to finding subgraph matches ofquery graph Q over an RDF graph G, as employed, for example, inthe gStore system [36]. We note that the problem in this context is

different than graph subgraph matching in general graph processingsystems since these depend on graph isomorphism while SPARQLquery semantics are based on graph homomorphism.

There are several ways to address this issue [14]. One method isanalogous to data warehousing where all of the data is downloadedinto a single RDF database and queries are run over it using cen-tralized techniques, such as Jena [32], RDF-3x [24], SW-store [1]and gStore [36]. This is a reasonable approach, but at web scale,the feasibility of these approaches is a serious concern as is thefreshness of query results.

A second approach that favors query result freshness is live query-ing [13]. In this approach, one starts with a set of seed LOD docu-ments and follows the RDF links inside these until no further docu-ments are reachable. These approaches provide up-to-date results,but have high latency that is hard to optimize.

The third alternative is to treat the interconnected RDF reposito-ries as a virtually integrated distributed database, where each indi-vidual RDF repository is regarded as a fragment of this database.Each fragment Fi resides at its own site S i and each query is ex-ecuted at the relevant sites. Within this context, there are two ap-proaches. One is to decompose queries and evaluate each subqueryat the appropriate sites. The query decomposition strategy requireshaving full knowledge about the underlying schema of each frag-ment Fi, which is not always feasible in RDF datasets since theirfundamental characteristic is that they are self-describing withoutrequiring an explicit schema. This is the approach followed bymost existing works [18, 28, 27]. For example, DARQ [27] needsthe service description that describes which triple patterns can beanswered in each site, based on which, the SPARQL query is de-composed into several subqueries. The alternative approach is touse the “partial evaluation and assembly” strategy [17]. The ba-sic idea of the partial evaluation is the following: given a functionf (s, d), where s is the known input and d is yet unavailable input,the part of f ’s computation that depends only on s generates a par-tial answer [8]. In our setting, each site S i treats fragment Fi as theknown input in the partial evaluation stage. The partial evaluationtechnique has been used in compiler optimization [17], queryingXML trees [4], and evaluating reachability queries [8] and graphsimulation [23, 9] over graphs. In this paper, we follow the partialevaluation and assembly strategy to evaluate SPARQL queries overa distributed and interconnected set of RDF repositories.

Because of interconnections between datasets, processing of queriesrequires special care. For example, consider the four RDF datasetsat different sites in LOD shown in Figure 1 and its graph repre-sentation in Figure 2. Each entity in RDF is represented by a URI(uniform resource identifier), the prefix of which always denotesthe location of the dataset. For example, “s1:dir1” has the prefix“s1”, meaning that the entity is located at site s1. There are cross-

1

arX

iv:1

411.

6763

v1 [

cs.D

B]

25

Nov

201

4

(a) RDF repository F1

Subject Property Objects1:dir1 directed s1:fil1s1:dir1 directed s1:fil2s1:dir1 hasWonPrize s3:awa1s1:dir1 rdfs:label Woody Allens3:act2 actedIn s1:fil5s1:act4 rdfs:label Hank Azarias1:act4 actedIn s1:fil7s1:fil1 rdfs:label Sleepers1:fil2 rdfs:label Small Time Crookss1:fil5 rdfs:label A Very Young Ladys1:fil7 rdfs:label Mystery Mens3:act3 actedIn s1:fil2s2:act1 actedIn s1:fil7s2:act1 isMarriedTo s1:dir1

(b) RDF repository F2

Subject Property Objects2:act1 actedIn s2:fil3s2:act1 actedIn s2:fil4s2:act1 actedIn s1:fil7s2:act1 isMarriedTo s1:dir1s2:act1 livesIn s2:pla1s2:act1 rdfs:label Louise Lassers4:dir2 livesIn s2:pla1s2:fil3 rdfs:label Mary Hartmans2:fil4 rdfs:label Slithers2:pla1 rdfs:label New York

(c) RDF repository F3

Subject Property Objects3:act2 rdfs:label Nancy Kellys3:act3 actedIn s1:fil2s3:act3 hasWonPrize s3:awa1s3:act3 rdfs:label Hugh Grants3:act2 isMarriedTo s4:dir2s3:awa1 rdfs:label Cesar Awards3:act2 actedIn s1:fil5

(d) RDF repository F4

Subject Property Objects3:act2 isMarriedTo s4:dir2s4:dir2 directed s4:fil6s4:dir2 livesIn s2:pla1s4:dir2 rdfs:label Edmond O’Briens4:fil6 rdfs:label Man-Trap

Table 1: Sample RDF Datasets in LOD

yet unavailable input, the part of f ’s computation that depends on-ly on s generates a partial answer [13]. In our setting, each siteS i treats fragment Fi as the known input in the partial evaluationstage. The partial evaluation technique has been used in many ar-eas, such as compiler optimisation [19], querying XML trees [9]and reachability queries over graphs [13]. In this paper, we followthe partial evaluation and assembly strategy over a distributed andinterconnected set of RDF repositories.

Answering a SPARQL query Q is equivalent to finding subgraphmatches of Q over an RDF graph G [31]. Figures 1 and 2 show anexample RDF graph and a query graph, respectively. A subgraphmatch can be completely included in a fragment, which we call aninner match. These inner matches can be found locally by existingcentralized techniques at each site. The key issue in the distributedenvironment is how to find subgraph matches that cross multiplefragments—these are called as crossing matches. This is the focusof this paper. For example, given a SPARQL query Q in Figure2, the subgraph induced by vertices 014, 007, 001, 002, 009 and018 is a crossing match between fragments F1 and F2 in Figure 1(shown in the shaded vertices and red edges).

There are two important issues to be addressed in the partial e-valuation and assembly framework. The first one is what is thepartial evaluation result in answering SPARQL queries. We re-fer to this as the local partial match (see Definition 4.1), whichis intuitively the overlapping part between a crossing match and afragment. The second one is how to assemble these local partialmatches to form crossing matches. In this work, we consider twodifferent strategies. One is the centralized assembly, where all lo-cal partial matches are sent to a single site. The other one is toassemble the local partial matches at a number of sites in parallel,called distributed assembly. For distributed assembly, we designa synchronous and an asynchronous algorithm. We experimental-ly study the performance of alternatives in different settings. Theexperimental results are useful for designing an efficient SPARQLquery engine over distributed RDF repositories.

To summarize, we make the following contributions.

1. This is the first work that considers the interconnected RDFrepositories as a virtually integrated distributed database, andpropose the partial evaluation and assembly framework forrunning SPARQL queries.

2. We formally define the partial evaluation result for SPARQL

queries and propose an efficient algorithm to find them. Weprove theoretically the correctness and the optimality of ouralgorithm.

3. We propose centralized and distributed methods to assemblethe local partial matches. We experimentally study their per-formance advantages and bottlenecks in different settings.

4. We compare our method with the state-of-the-art distributedRDF systems over several real and benchmark RDF reposi-tories with billion triples. The extensive experiment resultsconfirm the advantages of our approach.

!"#$

%%&

%

%'

%

"

%(

%%

))

%&

&%%

'

%*+$

%,

-

. /

--

-0

!"#$

+! 1

0

%%'

%%(

%

%-

%'

%

2

%&''

%%

%-

%*

-

%%,

.0

%%!

%,

%%

) +

'

2

3 456

'(

)7

%%-

%

%

%

%%*

%(

Figure 1: An Integrated Distributed Graph

2

Figure 1: Sample RDF Datasets in LOD

actedIn

rdfs:label

s1:fil1

s1:fil2

directed

directed

Sleeper

Small Time Crooks

rdfs:label

rdfs:label

s1:fil5

actedIn

A Very Young Lady

hasWonPrize

007

011

014

015

Woody Allen

016

s1:dir1

001

Mystery Men

rdfs:label

027

s1:fil7005

s1:act4

028Hank Azaria

029

rdfs:label

rdfs:label

s3:act2

Nancy Kelly

rdfs:label

s3:act3

s3:awa1

hasWonPrize

Hugh Grant

rdfs:label Cesar Award

rdfs:label

004

006

012

023

024

025

actedIn

rdfs:label

actedIn

actedIn

rdfs:label

rdfs:label

livesIn

rdfs:label

Louise Lasser

017

s4:fil4

010

s2:pla1

013

018

s2:fil3

009

New York

020Slither

019

s2:act1

002

Mary Hartman

s4:dir2

livesIn

Edmond O'Brien

rdfs:label

s4:fil6

directed

Man-Trap

rdfs:label

003

012

021

022

1 1Fragment at Site F S2 2Fragment at Site F S

3 3Fragment at Site F S4 4Fragment at Site F S

008

026

actedIn

ismarriedTo

ismarriedTo

Figure 2: An Integrated Distributed Graph

ing links between two datasets identified in bold font. For example,“〈s2:act1 isMarriedTo s1:dir1〉” is a crossing link (links betweendifferent datasets), which means that act1 (at site s2) is married todir1 (at site s1).

Now consider the following SPARQL query Q that consists offive triple patterns (e.g., ?a isMarriedTo ?d) over these four inter-connected RDF repositories:

SELECT ? a ? d WHERE? a i s M a r r i e d T o ? d . ? a a c t e d I n ? f1 .? f1 r d f s : l a b e l ? n1 . ? d d i r e c t e d ? f2 .? f2 r d f s : l a b e l ? n2 .

Some of the triple patterns may be resolved completely withina fragment, which we call an inner match or local partial match.These inner matches can be found locally by existing centralizedtechniques at each site. However, if we consider the four datasetsindependently and ignore the crossing links, some correct answerswill be missed, such as (?a=s2:act1, ?d=s1:dir1). The key issue inthe distributed environment is how to find subgraph matches thatcross multiple fragments—these are called crossing matches. Forthe query Q in Figure 3, the subgraph induced by vertices 014, 007,

001, 002, 009 and 018 is a crossing match between fragments F1

and F2 in Figure 2 (shown in the shaded vertices and red edges).This is the focus of this paper.

directed rdfs:label

rdfs:labelactedIn

isMarriedTov1

?a

?d

v2

?f1

v3

?f2

v4

?n1

v5

?n2

v6

Figure 3: SPARQL Query Graph Q

There are two important issues to be addressed in this frame-work. The first is to compute the partial evaluation results at eachsite given a query graph Q (i.e., the local partial match), whichis intuitively the overlapping part between a crossing match and afragment. This is discussed in Section 4. The second one is the as-sembly of these local partial matches to compute crossing matches.We consider two different strategies: centralized assembly, whereall local partial matches are sent to a single site (Section 5.2); anddistributed assembly, where the local partial matches are assembledat a number of sites in parallel (Section 5.3).

This is the first work that considers the LOD repositories (or sim-ilarly distributed and autonomous RDF repositories) as a virtuallyintegrated distributed database, and proposes the partial evaluationand assembly framework for executing SPARQL queries. Withinthis context we make the following contributions.

1. We formally define the partial evaluation result for SPARQLqueries and propose an efficient algorithm to find them. Weprove theoretically the correctness and the optimality of ouralgorithm.

2. We propose centralized and distributed methods to assemblethe local partial matches and experimentally study their per-formance advantages and bottlenecks in different settings.

2. RELATED WORKThere are two threads of related work: distributed SPARQL query

processing and partial evaluation.Distributed SPARQL Query Processing. As noted above, gen-

erally speaking, there are two different approaches to distributedSPARQL query processing: federated SPARQL query processing,and data partition-based query processing.

(1) Federated SPARQL Query ProcessingFederated queries run SPARQL queries over multiple distributed

and autonomous RDF datasets. A typical example is linked data,where different RDF repositories are interconnected, providing avirtually integrated distributed database.

2

A common technique is to precompute metadata for each indi-vidual RDF repository. Based on the metadata, the original SPARQLquery is decomposed into several subqueries, where each subqueryis sent to some RDF repositories that could answer the query. Thesesubquery results are then joined together to answer the originalSPARQL query. In DARQ [27], the metadata is called service de-scription that describes which triple patterns (i.e., predicate) canbe answered. Consider each triple pattern t in a SPARQL queryQ. If the triple pattern t matches exactly one data source, the triplewill be added to the set of a subquery for this RDF data source.If t matches multiple data sources, the triple must be sent to allmatching RDF data sources as separate subqueries. In [26, 12], themetadata is called Q-Tree, based on which, the SPARQL is alsodecomposed into several parts.

FedX [28] follows the same approach, but it does not requirepreprocessing. However, for each triple pattern t in a SPARQLquery Q, FedX sends “SPARQL ASK” to collect the metadata onthe fly. Based on the returned results, it annotates each triple patternwith its relevant sources. Each triple pattern is a basic query unit.In order to improve the query performance, several triple patternsin the same exclusive groups [28] can be sent in a single subqueryto the respective source.

The problem studied in this paper belongs to federated SPARQLquery approach, but, we adopt a different strategy, namely partialevaluation and assembly. Rather than decomposing a query, wesend the whole SPARQL query to each RDF data source. The par-tial evaluation results from each RDF data source are merged to-gether. The main benefits of our solution are twofold: (1) The num-ber of remote requests is minimized. For example, for each triplepattern t in SPARQL Q, FedX sends “SPARQL ASK” to each RDFdata source. For each data source, the number of remote requests is|E(Q)| (the number of edges in SPARQL query graph Q). However,we only send a single SPARQL query to each RDF data source. (2)The intermediate result size is reduced. We propose several graphstructure-based pruning techniques to reduce the intermediate re-sult size. Experimental results show the our approach outperformsthe existing ones. We note that our approach requires minor modifi-cations to SPARQL endpoints to facilitate partial query evaluation.

(2) Data Partition-based MethodA number of approaches [20, 15, 16, 35, 34, 21, 22, 11] start

with a large RDF graph G (i.e., follow what we referred to asthe data warehousing approach), and partition G into several frag-ments. At run time, a SPARQL query Q is decomposed into severalsubqueries such that each subquery can be answered locally at onesite. Note that each of these works proposes its own data partition-ing strategy, and different partitioning strategies result in differentprocessing methods. Generally speaking, there are three major par-titioning strategies: table-based, graph-based, and unit-based.

Table-based methods [20, 16, 35] partition the RDF data by con-sidering it as a set of relational tables. Jena-HBase [20] is a dis-tributed version of Jena [32] that maps Jena tables to HBase. Aquery Q is also translated to a query over HBase and is executed onthe mapped tables. HadoopRDF [16] and P-Partition [35] extendSW-store [1] by further partitioning vertically partitioned tables.They also decompose the SPARQL query according to the predi-cates in the triple patterns.

Graph-based methods [15, 15, 33] partition the RDF graph us-ing graph partitioning techniques. In GraphPartition [15], an RDFgraph G is partitioned into n fragments, and each fragment is ex-tended by including N-hop neighbors of boundary vertices. Ac-cording to the partitioning strategy, the diameter of the graph cor-responding to each decomposed subquery should not be larger thanN to enable subquery processing at each local site. In EAGRE [15],

the graph fragmentation is based on the predicates adjacent to theentities. If two entities’ incident predicates are similar, they are inone class and should be in the same fragment. At run time, querygraph Q’s decomposition always depends on the predicates inci-dent to each variable in Q. Trinity.RDF [33] stores the whole RDFgraph in a distributed graph system over a memory cloud (calledTrinity [30]). Trinity.RDF uses the vertex in the RDF graph as thekey and its adjacency list as the value. Since Trinity is a mainmemory system, it can randomly access all neighbors of a vertex.Hence, Trinity.RDF uses a graph exploration-based method to an-swer SPARQL queries.

Unit-based methods [21, 22, 11] decompose the RDF data intosome partition units, and distribute these units among different sites.Kisung Lee et. al. [21, 22] define the partition unit as a vertex andits neighbors, which they call a “vertex block”. The vertex blocksare distributed based on a set of heuristic rules. A query is parti-tioned into blocks that can be executed among all sites in paralleland without any communication. TriAD [11] distributes RDF-3x[24]. TriAD uses METIS [19] to divide the RDF graph into manypartitions and the number of result partitions is much more than thenumber of sites. Each result partition is considered as a unit anddistributed among different sites. Meanwhile, TriAD constructs asummary graph to maintain the partitioning information. Duringthe query processing phase, TriAD determines a query plan basedon this summary graph.

All of the above methods require partitioning and distributing theRDF data according to specific requirements of their approaches.In contrast, in LOD, the RDF are already distributed over differentsites. Therefore, these data partition-based solutions cannot be usedfor processing queries over LOD unless LOD are collected in onecentral site and redistributed again, which is impractical.

Partial Evaluation. Partial evaluation has been used in manyapplications ranging from compiler optimization to distributed eval-uation of functional programming languages [17]. Recently, partialevaluation has also been used for evaluating queries on distributedXML trees and graphs [3, 4, 5, 8]. In [3, 4, 5], partial evaluationis used to evaluate some XPath queries on distributed XML. Theseworks serialize XPath queries to a vector of subqueries, and findthe partial results of all subqueries at each site by using a top-down[4] or bottom-up [3] traversal over the XML tree. Finally, all par-tial results are assembled together at the server site to form finalresults. Note that, since XML is a tree-based data structure, theseworks serialize XPath queries and traverse XML trees in a topolog-ical order. However, the RDF data and SPARQL queries are graphsrather than trees. Serializing the SPARQL queries and traversingthe RDF graph in a topological order is not intuitive.

There are some prior works that consider partial evaluation ongraphs. For example, Fan et al [8] study reachability query pro-cessing over distributed graphs using the partial evaluation strategy.Partial evaluation-based graph simulation is well studied in Fan etal. [9] and Shuai et al. [23]. However, SPARQL query seman-tics is different than these (i.e., SPARQL is based on graph homo-morphism [25]) and pose additional challenges. To the best of ourknowledge, there is no prior work in applying partial evaluation toSPARQL query processing. As discussed in [9], graph simulationdefines a relation between vertices in query graph Q (i.e. V(Q))and that in the data graph G (i.e., V(G)), but, the graph homomor-phism is a function (not a relation) between V(Q) and V(G). Thus,the solutions proposed in [9, 23] cannot be applied to the problemstudied in this paper.

3. BACKGROUND

3

An RDF data set can be represented as a graph where subjectsand objects are vertices and triples are labeled edges.

DEFINITION 3.1. (RDF Graph) An RDF graph is denoted asG = V, E,Σ, where V is a set of vertices that correspond to allsubjects and objects in RDF data; E ⊆ V × V is a set of directededges that correspond to all triples in RDF data; Σ is a set of edgelabels. For each edge e ∈ E, its edge label is its correspondingproperty.

Similarly, a SPARQL query can also be represented as a querygraph Q. For simplicity, we ignore FILTER statements in SPARQLsyntax and only consider BGP (basic graph pattern) in this paper.

DEFINITION 3.2. (SPARQL Query) A SPARQL query is de-noted as Q = VQ, EQ,ΣQ, where VQ ⊆ V ∪ VVar is a set of ver-tices, where V denotes all vertices in RDF graph G and VVar is aset of variables; EQ ⊆ VQ × VQ is a set of edges in Q; Each edge ein EQ either has an edge label in Σ (i.e., property) or the edge labelis a variable.

In this paper, we assume that Q is a connected graph; otherwise,all connected components of Q are considered separately. Answer-ing a SPARQL query is equivalent to finding all subgraph matches(Definition 3.3) of Q over RDF graph G.

DEFINITION 3.3. (SPARQL Match) Consider a RDF graph Gand a connected1 query graph Q that has n vertices v1, ..., vn.A subgraph M with m vertices u1, ..., um (in G) is said to be amatch of Q if and only if there exist a function f from v1, ..., vn tou1, ..., um (n ≥ m), where the following conditions hold:

1. if vi is not a variable, f (vi) and vi have the same URI orliteral value (1 ≤ i ≤ n);

2. if vi is a variable, there is no constraint over f (vi) except thatf (vi) ∈ u1, ..., um ;

3. if there exists an edge −−→viv j in Q, there also exists an edge−−−−−−−−→f (vi) f (v j) in G; furthermore,

−−−−−−−−→f (vi) f (v j) has the same prop-

erty as −−→viv j unless that the label of −−→viv j is a variable.

Vector [ f (v1), ..., f (vn)] is a serialization of a SPARQL match.Note that we allow that f (vi) = f (v j) when 1 ≤ i , j ≤ n. In otherwords, a match of SPARQL Q defines a graph homomorphism

In the context of this paper, RDF graph G is distributed overdifferent sites. We call G an integrated distributed RDF graph.Each constituent RDF repository is called a fragment of G. Forsimplicity, we assume that each entity is only resident in a singlesite, i.e., we ignore replication (see condition (1) of Definition 3.4),but handling the more general case is not complicated.

DEFINITION 3.4. (Integrated Distributed RDF Graph) Givenk interconnected RDF repositories, each of them represented as anRDF graph Fi (i = 1, ..., k), the integrated distributed RDF graphG = V, E,Σ consists of a set of F1, F2, ..., Fk. Each Fi is called afragment of G, which is specified by (Vi∪Ve

i , Ei∪Eci ,Σi) (i = 1, ..., k)

such that

1. V1, ...,Vk is a partitioning of V , i.e., Vi ∩ V j = ∅, 1 ≤ i, j ≤k, i , j and

⋃i=1,...,k Vi = V ;

2. Ei ⊆ Vi × Vi, i = 1, ..., k;1In this paper, we assume that query graph Q is connected; other-wise, we can process each connected connected component of Qseparately

3. Eci is a set of crossing edges between Fi and other fragments,

i.e., Eci = (⋃

1≤ j≤k∧ j,i −−→uu′|u ∈ Fi ∧ u′ ∈ F j ∧

−−→uu′ ∈ E)

⋃(⋃

1≤ j≤k∧ j,i −−→u′u|u ∈ Fi ∧ u′ ∈ F j ∧

−−→u′u ∈ E);

4. A vertex u′ ∈ Vei if and only if vertex u′ resides in other frag-

ment F j and u′ is an endpoint of a crossing edge between

fragment Fi and F j (Fi , F j), i.e.,Vei = (

⋃1≤ j≤k∧ j,iu′|

−−→uu′

∈ Eci ∧ u ∈ Fi)

⋃(⋃

1≤ j≤k∧ j,i u′|−−→u′u ∈ Ec

i ∧ u ∈ Fi);

5. Vertices in Vei are called extended vertices of Fi and all ver-

tices in Vi are called internal vertices of Fi;

6. Σi is a set of edge labels in Fi.

EXAMPLE 1. Given four interconnected RDF repositories inFigure 1, the integrated RDF graph G consists of four constituentRDF fragment graphs F1, F2, F3 and F4 in Figure 2. The numbersbesides the vertices are vertex IDs that are introduced for ease ofpresentation. In Figure 2,

−−−−−−−→002, 001 is a crossing edge between F1

and F2. As well, edges−−−−−−−→004, 011,

−−−−−−−→001, 012 and

−−−−−−−→006, 008 are cross-

ing edges between F1 and F3. Hence, Ve1 = 002, 006, 012, 004

and Ec1 = −−−−−−−→002, 001,

−−−−−−−→004, 011,

−−−−−−−→001, 012,

−−−−−−−→006, 008.

DEFINITION 3.5. (Problem Statement) Given an integrated dis-tributed RDF G that consists of k fragments F1, F2, ..., Fk, eachlocated at a different computating node, and a SPARQL query graphQ, our goal is to find all SPARQL matches of Q in G.

Inner matches can be computed locally by a centralized RDFtriple store, such as RDF-3x [24], SW-store [1] and gStore [36].The main issue of answering SPARQL queries over the distributedRDF graph is finding crossing matches efficiently. That is the focusof this paper.

4. PARTIAL EVALUATIONWe first discuss what is a local partial match (Section 4.1) and

then discuss how to compute it efficiently (Section 4.2).

4.1 Local Partial Match—DefinitionRecall that each site S i receives the full query graph Q (i.e., there

is no query decomposition). In order to answer a SPARQL queryQ, each site S i computes the partial answers (called local partialmatches) based on the known input Fi. Intuitively, a local partialmatch PMi (in fragment Fi) is an overlapping part between an un-known crossing match M and fragment Fi. An unknown crossingmatch M means that we do not know M at the partial evaluationstage. Moreover, the unknown crossing match M may or may notexist depending on the yet unavailable input d, i.e., other fragmentsof G. Based on the known input, i.e, fragment Fi, we cannot judgewhether or not M exists. For example, the subgraph induced byvertices 014, 007, 001 and 002 (shown in shared vertices and rededges) in Figure 2 is a local partial match between M and F1.

DEFINITION 4.1. (Local Partial Match) Given a SPARQL querygraph Q with n vertices v1, ..., vn and a connected subgraph PMwith m vertices u1, ..., um (m ≤ n) in a fragment Fk = (Vk∪Ve

k , Ek∪

Eck ,Σk), PM is a local partial match in fragment Fk if and only if

there exists a function f : v1, ..., vn → u1, ..., um ∪ NULL ,where the following conditions hold:

1. If vi is not a variable, f (vi) and vi have the same URI orliteral or f (vi) = NULL.

2. If vi is a variable, f (vi) ∈ u1, ..., um or f (vi) = NULL.

4

3. If there exists an edge −−→viv j in Q (1 ≤ i , j ≤ n), there also

exists an edge−−−−−−−−→f (vi) f (v j) in PM with property p unless f (vi)

and f (v j) are both in Vek or f (vi) = NULL ∨ f (v j) = NULL.

Furthermore,−−−−−−−−→f (vi) f (v j) has the same property as −−→viv j unless

the label of −−→viv j is a variable.

4. PM contains at least one crossing edge.

5. If f (vi) ∈ Vk (i.e., f (vi) is an internal vertex in Fk) and∃−−→viv j ∈ Q (or −−→v jvi ∈ Q), there must exist f (v j) , NULL

and ∃−−−−−−−−→f (vi) f (v j) ∈ PM (or ∃

−−−−−−−−→f (v j) f (vi) ∈ PM). Furthermore,

if −−→viv j (or −−→v jvi) has a property p,−−−−−−−−→f (vi) f (v j) (or

−−−−−−−−→f (v j) f (vi))

has the same property p.

6. Any two vertices vi and v j (in query Q), where f (vi) and f (v j)are both internal vertices in PM, are weakly connected (seeDefinition 4.2) in Q.

Vector [ f (v1), ..., f (vn)] is a serialization of a local partial match.

EXAMPLE 2. Given a SPARQL query Q with six vertices inFigure 3, the subgraph induced by vertices 002, 001, 007 and 014(shown in shaded circles and red edges) is a local partial match of Qin fragment F1. The function is (v1, 002), (v2, 001), (v3,NULL), (v4,007), (v5,NULL), (v6, 014). Actually, there are five different localpartial matches in F1. We show them in Figure 4.

Definition 4.1 formally defines a local partial match, which isa subset of a complete SPARQL match. Therefore, some condi-tions in Definition 4.1 are analogous to SPARQL match with somesubtle differences. In Definition 4.1, some vertices of query Q arenot matched in a local partial match. They are allowed to matcha special value NULL. For example, v3 and v5 in Example 2 arematched to NULL. As mentioned earlier, a local partial match isthe overlapping part of an unknown crossing match and a fragmentFi. Therefore, it must have a crossing edge, i.e, Condition 4.

The basic intuition of Condition 5 is that if vertex vi (in queryQ) is matched to an internal vertex, all vi’s neighbours should bematched in this local partial match as well. The following exampleillustrates the intuition.

EXAMPLE 3. Let us recall the local partial match PM21 of Frag-

ment F1 in Figure 4. An internal vertex 001 in fragment F1 ismatched to vertex v2 in query Q. Assume that PM is an overlappingpart between a crossing match M and fragment F1. Obviously, v2’sneighbors, such as v1 and v4, should also be matched in M. Fur-thermore, the matching vertices should be 001’s neighbors. Since001 is an internal vertex in F1, 001’s neighbors are also in frag-ment F1.

According to the above analysis, if a subgraph PM violates Con-dition 5, PM cannot be a subgraph of a crossing match. In otherwords, we are not interested in these subgraphs when finding localpartial matches, since they do not contribute to any crossing match.

DEFINITION 4.2. Two vertices are weakly connected in a di-rected graph if and only if there exists a connected path betweenthe two vertices if we replace all directed edges with undirectededges. The path is called a weakly connected path between the twovertices.

Condition 6 will be used to prove the correctness of our algo-rithm in Theorems 4.1 and 4.2. We will discuss them soon. Thefollowing example shows all local partial matches in the runningexample.

EXAMPLE 4. Given a query Q in Figure 3 and an RDF graphG in Figure 2, Figure 4 shows all local partial matches and theirserialization vectors in each fragment. A local partial match infragment Fi is denoted as PM j

i , where the superscript distinguishesdifferent local partial matches in the same fragment. Furthermore,we underline all extended vertices in serialization vectors.

The correctness of our method are stated in the following theo-rem2.

1. The overlapping part between any crossing match M andfragment Fi (i = 1, ..., k) must be a local partial match.

2. Missing any local partial match may lead to result dismissal.Thus, our algorithm should find all local partial matches ineach fragment.

3. It is impossible to find two local partial matches M and M′

in fragment F, where M′ is a subgraph of M, i.e., each localpartial match is maximal.

THEOREM 4.1. Given any crossing match M of SPARQL queryQ in an RDF graph G, if M overlaps with some fragment Fi (i =

1, ..., k), the overlapping part between M and fragment Fi, denotedas PM, must be a local partial match in fragment Fi.

Let us recall Example 4. There are some local partial matchesthat do not contribute to any crossing match, such as PM5

1 in Fig-ure 4. We call these local partial matches false positives. However,the partial evaluation stage only depends on the known input. Ifwe do not know the structures of other fragments, we cannot judgewhether or not PM5

1 is a false positive. Formally, we have the fol-lowing theorem, which means that we have to find all local partialmatches in each fragment Fi in the partial evaluation stage.

THEOREM 4.2. The results generated by the partial-evaluation-and-assembly algorithm satisfy no-false-negative requirement if andonly if all local partial matches in each fragment are found in thepartial evaluation stage.

Finally, we discuss another feature of a local partial match PMi

in fragment Fi. Any PMi cannot be enlarged by introducing morevertices or edges to become a larger local partial match. The fol-lowing theorem formalizes this.

THEOREM 4.3. Given a query graph Q and an RDF graph G,if PMi is a local partial match under function f in fragment Fi =

(V ∪ Vei , E ∪ Ec

i ), there exists no local partial match PM′i under

function f ′ in Fi, where f ⊂ f ′.

4.2 Computing Local Partial MatchesGiven a SPARQL query Q = (VQ, EQ) and a fragment Fi =

(Vi ∪ Vei , Ei ∪ Ec

i ), the partial evaluation goal is to find all localpartial matches (according to Definition 4.1) in Fi. The matchingprocess consists of determining a function f that associates verticesof Q with vertices of Fi. The matches are expressed as a set of pairs(v, u) (v ∈ Q and u ∈ Fi). A pair (v, u) represents the matching of avertex v of query Q with a vertex u of fragment Fi. The set of vertexpairs (v, u) constitutes function f referred to in Definition 4.1.

A high-level description of finding local partial matches is out-lined in Algorithm 1 and Function ComParMatch. According toConditions 1 and 2 of Definition 4.1, each vertex v in query graphQ has a candidate list of vertices in fragment Fi. Since function fis as a set of vertex pairs (v, u) (v ∈ Q and u ∈ Fi), we start with anempty set. In each step, we introduce a candidate vertex pair (v, u)2The proofs of all theorems are in Appendix.

5

1

2PM

rdfs:labely:actedIn

y:directed rdfs:label

y:isMarriedTo


y:isMarriedTo

rdfs:labely:actedIn rdfs:labely:actedIn

y:actedIn

y:isMarriedTo

y:actedIn

y:isMarriedTo

rdfs:labely:actedIn

y:isMarriedTo


y:isMarriedTo

y:actedIn

y:isMarriedTo

3

1PM

4

1PM5

1PM

1

1PM

2

2PM

3

2PM

1

4PM1

3PM

[002,NULL,005,NULL,027,NULL]

[002,001,NULL,007,NULL,014] [002,001,NULL,008,NULL,015]

[004,NULL,011,NULL,026,NULL] [006,NULL,008,NULL,015,NULL]

[002,001,005,NULL,NULL,NULL] [002,001,009,NULL,018,NULL]

[002,001,010,NULL,019,NULL]

[004,003,NULL,012,NULL,022][004,003,011,NULL,NULL,NULL]

2

1PM

t

act1002

dir1001

fil2

008Small Time Crooks015

act1

002

fil7

005Mystery Men

027

act1002

dir1001

fil1007

Sleeper014

act2

004

fil5

011

A Very Young Lady

026act3

006

fil2

008

Small Time Crooks015

rdfs:labelMary Hartman,

Mary Hartman

018

fil3009

act1002

dir1001

fil7

005

dir1001

act1

002act1

002fil4

010Slither

019

dir1001

act2

004

dir2

003

fil6

012

Man-Trap022

dir2003

act2004

fil5011

Figure 4: Local Partial Matches of Q in Each Fragment

to expand the current function f , where vertex u (in fragment Fi) isa candidate of vertex v (in query Q).

Assume that we introduce a new candidate vertex pair (v′, u′) intothe current function f to form another function f ′. If f ′ violatesany condition except for Conditions 4 and 5 of Definition 4.1, thenew function f ′ cannot lead to a local partial match (Lines 6-7 inFunction ComParMatch). If f ′ satisfies all conditions except forConditions 4 and 5, it means that f ′ can be further expanded (Lines8-9 in Function ComParMatch). If f ′ satisfies all conditions, thenf ′ specifies a local partial match and it is reported (Lines 10-11 inFunction ComParMatch).Algorithm 1: Computing Local Partial Matches

Input: A fragment Fi and a query graph Q.Output: The set of all local maximal partial matches in Fi,

denoted as Ω(Fi).1 Select one vertex v in Q2 for each candidate vertex u with regard to v do3 Initialize a function f with (v, u)4 Call Function ComParMatch( f )5 Return Ω(Fi);

Function ComParMatch( f )1 if all vertices of query Q have been matched in the function f then2 Return;3 Select an unmatched v′ adjacent to a matched vertex v in the function f4 for each candidate vertex u′ with regard to v′ do5 f ′ ← f ∪ (v′, u′)6 if f ′ violates any condition (except for condition 4 and 5 of

Definition 4.1) then7 Continue8 if f ′ satisfies all conditions (except for condition 4 and 5 of

Definition 4.1) then9 ComParMatch( f ′)

10 if f ′ satisfies all conditions of Definition 4.1 then11 f specifies a local partial match PM that will be inserted into

the answer set Ω(Fi)

As shown in Algorithm 1, in each step, we introduce a new can-didate vertex pair (v′, u′), where v′ is a vertex in query graph Q. Theorder of selecting query vertex can be arbitrarily defined. However,QuickSI [29] proposes several heuristic rules to select an optimized

order that can speed up the matching process. These rules are alsoutilized in our experiments.

5. ASSEMBLYUsing Algorithm 1, each site S i finds all local partial matches

in fragment Fi. The next step is to assemble partial matches toform crossing matches and compute the final result. We proposetwo assembly strategies: centralized and distributed (or parallel).In centralized, all local partial matches are sent to a single site forassembly. For example, in a client/server system, all local partialmatches may be sent to the server. In distributed/parallel, localpartial matches are combined at a number of sites in parallel. InSection 5.1, we define a basic join operator for assembly. Then,we propose a centralized assembly algorithm in Section 5.2 usingthe join operator. In Section 5.3, we study how to assemble localpartial matches in a distributed manner.

5.1 Join-based AssemblyWe first define the conditions under which two partial matches

are joinable. Obviously, crossing matches can only be formed byassembling partial matches from different fragments. If local par-tial matches from the same fragment could be assembled, this wouldresult in a larger local partial match in the same fragment, which iscontrary to Theorem 4.3.

DEFINITION 5.1. (Joinable) Given a query graph Q and twofragments Fi and F j (i , j), let PMi and PM j be two local par-tial matches over fragments Fi and F j under functions fi and f j,respectively. PMi and PM j are joinable if and only if the followingconditions hold: 1). There exist no internal vertices u and u′ inPMi and PM j, respectively, such that f −1

i (u) = f −1j (u′); 2). There

exists at least one crossing edge−−→uu′ such that u is an internal vertex

and u′ is an extended vertex in Fi, while u is an extended vertex andu′ is an internal vertex in F j. Furthermore, f −1

i (u) = f −1j (u) and

f −1i (u′) = f −1

j (u′).

The first condition says that there does not exist different ver-tices in separate partial matches that match the same query vertex.The second condition says that two local partial matches share atleast one common crossing edge that corresponds to the same queryedge.

6

EXAMPLE 5. Let us recall query two local partial matches PM21

and PM22 in Figure 4. There do not exist two different vertices in

the two local partial matches that match the same query vertex.Furthermore, they share a common crossing edge

−−−−−−−→002, 001, where

002 and 001 match query vertices v2 and v1 in the two local partialmatches, respectively. Hence, they are joinable.

The join result of two joinable local partial matches is defined asfollows.

DEFINITION 5.2. Join Result. Given a query graph Q and twofragments Fi and F j, i , j, let PMi and PM j be two joinable localpartial matches of Q over fragments Fi and F j under functions fi

and f j, respectively. The join of PMi and PM j is defined under anew function f (denoted as PM = PMi Z f PM j), which is definedas follows for any vertex v in Q:

1. if fi(v) , NULL ∧ f j(v) = NULL 3, f (v)← fi(v) 4;

2. if fi(v) = NULL ∧ f j(v) , NULL, f (v)← f j(v);

3. if fi(v) , NULL ∧ f j(v) , NULL, f (v)← fi(v) (In this case,fi(v) = f j(v))

4. if fi(v) = NULL ∧ f j(v) = NULL, f (v)← NULL

5.2 Centralized AssemblyIn centralized assembly, all local partial matches are sent to a

final assembly site. We propose an iterative join algorithm (Algo-rithm 2) to find all crossing matches. In each iteration, a pair oflocal partial matches are joined. When the join is complete (i.e., amatch has been found), the result is returned (Lines 12-13 in Al-gorithm 2); otherwise, it is joined with other local partial matchesin the next iteration (Lines 14-15). There are |V(Q)| iterations ofLines 4-16 in the worst case, since at each iteration only a singlenew matching vertex is introduced (worst case) and Q has |V(Q)|vertices. If no new intermediate results are generated at some iter-ation, the algorithm can stop early (Lines 5-6).

5.2.1 Partitioning-based Join ProcessingThe join space in Algorithm 2 is large, since we need to check

if every pair of local partial matches PMi and PM j are joinable.This subsection proposes an optimized technique to reduce the joinspace.

The intuition of our method is as follows. We partition all lo-cal partial matches such that two local partial matches in the samepartition cannot be joinable; we only consider joining local partialmatches from different partitions. The following theorem specifieswhich local partial matches can be put in a partition.

THEOREM 5.1. Given two local partial matches PMi and PM j

from fragments Fi and F j with functions fi and f j, respectively, ifthere exists a query vertex v where both fi(v) and f j(v) are internalvertices of fragments Fi and F j, respectively, PMi and PM j are notjoinable.

EXAMPLE 6. Figure 5 shows the serialization vectors (definedin Definition 4.1) of four local partial matches. For each localpartial match, there is an internal vertex that matches v1 in querygraph. The underline indicates the extended vertex in the local par-tial match. According to Theorem 5.1, none of them are joinable.3 f j(v) = NULL means that vertex v in query Q is not matched inlocal partial match PM j. It is formally defined in Definition 4.1condition (2)4In this paper, we use “←” to denote the assignment operator.

Algorithm 2: Centralized Join-based AssemblyInput: Ω(Fi), i.e, the set of local partial matches in each

fragment Fi, i = 1, ..., kOutput: All crossing matches set RS .

1 Each fragment Fi sends the set of local partial matches in eachfragment Fi (i.e., Ω(Fi)) to a single site for the assembly

2 Let Ω←⋃i=k

i=1 Ω(Fi)3 Set MS ← Ω

4 for N ← 1 to |V(Q)| do5 if |MS |=0 then6 Break;7 Set MS ′ ← φ8 for each local partial match PM in MS do9 for each local partial match PM′ in Ω do

10 if PM and PM′ are joinable then11 Set PM′′= PM Z PM′

12 if PM′′ is a complete match of Q then13 put PM′′ into RS14 else15 put PM′′ into MS ′

16 MS ← MS ′

17 Return RS

12PM22PM32PM13PM

Figure 5: Partitioning Local Partial Matches on v1

DEFINITION 5.3. Local Partial Match Partitioning. Considera SPARQL query Q with n vertices v1, ..., vn. Let Ω denote alllocal partial matches. P = Pv1 , ..., Pvn is a partitioning of Ω ifand only if the following conditions hold: 1). Each partition Pvi

(i = 1, ..., n) consists of a set of local partial matches, each of whichhas an internal vertex that matches vi; 2). Pvi ∩ Pv j = ∅, where1 ≤ i , j ≤ n; 3). Pv1 ∪ ... ∪ Pvn = Ω.

EXAMPLE 7. Let us consider all local partial matches of ourrunning example in Figure 4. Figure 6 show an example partition-ing.

As mentioned earlier, we only need to consider joining local par-tial matches from different partitions. Given a partitioning P =

Pv1 , ..., Pvn , Algorithm 3 shows how to perform partitioning-basedjoin of local partial matches. Note that different partitionings andthe different join orders in the partitioning will impact the perfor-mance of Algorithm 3. In Algorithm 3, we assume that the parti-tioning P = Pv1 , ..., Pvn is given, and that the join order is fromPv1 to Pvn , i.e. the order in P. Choosing a good partitioning and theoptimal join order will be discussed in Section 5.2.2 and Section5.2.3.

The basic idea of Algorithm 3 is to iterate the join process oneach partition. First, we set MS ← Pv1 (Line 1 in Algorithm 3).Then, we try to join local partial matches PM in MS with localpartial matches PM′ in Pv2 (the first loop of Line 3-13). If the joinresult is a complete match, it is inserted into the answer set RS(Lines 8-9). If the join result is an intermediate result, we insertit into a temporary set MS ′ (Lines 10-11). We also need to in-sert PM′ into MS ′, since the local partial match PM′ (in Pv2 ) willjoin local partial matches in the later partitions (Line 12). At the

7

1

2PM

2

2PM

3

2PM

1

3PM

2

1PM

3

1PM

4

2PM

4

1PM

5

1PM

1

1PM

3vP

1vP

2vP

4vP

5vP

6vP

Figure 6: An Example Partitioning

end of the iteration, we insert all intermediate results (in MS ′) intoMS , which will join local partial matches in the later partitions inthe next iterations (Line 13). We iterate the above steps for eachpartition in the partitioning (Lines 3-13).

Algorithm 3: Partitioning-based Joining Local Partial MatchesInput: A partitioning P = Pv1 , ..., Pvn of all local partial

matches.Output: All crossing matches set RS .

1 MS ← Pv1

2 for i← 2 to n do3 MS ′ ← ∅4 for each partial match PM in MS do5 for each partial match PM′ in Pvi do6 if PM and PM′ are joinable then7 Set PM′′ ← PM Z PM′

8 if PM′′ is a complete match then9 Put PM′′ into the answer set RS

10 else11 Put PM′′ into MS ′

12 Put PM′ into MS ′

13 Insert MS ′ into MS14 Return RS

5.2.2 Finding the Optimal PartitioningObviously, different partitionings lead to different join costs. Thus,

we define a cost-based “best” partitioning in this section.DEFINITION 5.4. Join Cost. Given a query graph Q with n

vertices v1,...,vn and a partitioning P = Pv1 ,...,Pvn over all localpartial matches Ω, the join cost is defined as follows.

Cost(Ω) = O(∏i=n

i=1(|Pvi | + 1)) (1)

where |Pvi | is the number of local partial matches in Pvi and 1 isintroduced to avoid the “0” element in the product.

Definition 5.4 assumes that each pair of local partial matches(from different partitions) are joinable, which is the worst case, sothat we can quantify the worst-case performance. For example, thecost of the partitioning in Figure 6 is 6 × 3 × 4 = 72. Of course,more sophisticated and more realistic cost functions can be usedinstead, but, determining the most appropriate cost function is amajor research issue in itself and outside the scope of this paper.We leave it as an interesting future research to study.

The key issue is how to find the optimal partitioning (i.e., the par-titioning with the minimal join cost). A naive enumeration methodhas O(n|Ω|) time complexity, where n is the number of vertices inquery Q and Ω denotes all local partial matches. Obviously, thenaive enumeration is not efficient, since |Ω| is very large in prac-tice. Therefore, we need to find better solutions for finding theoptimal partitioning.

Actually, we can reduce a 0-1 integer planning problem (a classi-cal NP-complete problem) to finding the optimal partitioning. Thus,finding the optimal partitioning is a NP-complete problem.

THEOREM 5.2. Finding the optimal partitioning (i.e., the joincost defined in Definition 5.4 is minimal) is a NP-complete problem.

Even that finding the optimal partitioning is a NP-complete prob-lem, we can still find a much faster algorithm with time complexityO(2n × |Ω|)). Note that this is a fixed-parameter tractable algorithm5, since n is small in practice. For example, n is no larger than 10in most benchmark SPARQL queries (such as LUBM benchmarks[10]).

Our algorithm is based on the following characteristic of the opti-mal partitioning. Consider a query graph Q with n vertices v1,...,vn.Let Uvi (i = 1, ..., n) denote all local partial matches (in Ω) thathave internal vertices matching vi. Unlike the partitioning definedin Definition 5.3, Uvi and Uv j (1 ≤ i , j ≤ n) may have overlaps.For example, PM3

2 (in Figure 4) contains an internal vertex 002 thatmatches v1, thus, PM3

2 is in Uv1 . PM32 also has internal vertex 010

that matches v3, thus, PM32 is also in Uv3 . However, the partitioning

defined in Definition 5.3 does not allow overlapping partitions.If a partitioning Popt = Pv1 , .., Pvn is optimal, the following

theorem holds.

THEOREM 5.3. Given a query graph Q with n vertices v1,...,vn

and a set of all local partial matches Ω, let Uvi (i = 1, ..., n) be alllocal partial matches (in Ω) that have internal vertices matchingvi. For the optimal partitioning Popt = Pv1 , .., Pvn where Pvn hasthe largest partition size (i.e., the number of local partial matchesin Pvn is maximum) in Popt, Pvn = Uvn .

Based on Theorem 5.3, we propose a dynamic programming al-gorithm. The intuition of the algorithm is as follows. Assumethat the optimal partitioning is Popt = Pv1 , Pv2 , ..., Pvn . We re-order the partitions in Popt in non-descending order of sizes, i.e.,Popt = Pvk1

, ..., Pvkn, |Pvk1

| ≥ |Pvk2| ≥ ... ≥ |Pvkn

|. According toTheorem 5.3, Pvk1

= Uvk1. Let us denote by Ωvk1

all local partialmatches (in Ω) that do not contain internal vertices matching vk1 ,i.e., Ωvk1

= Ω − Uvk1. It is straightforward to have the following

optimal substructure6 in Equation 2.Cost(Ω)opt = |Pvk1

| ×Cost(Ωvk1)opt

= |Uvk1| ×Cost(Ωvk1

)opt(2)

Since we do not know which vertex is vk1 , we introduce the fol-lowing optimal structure that is used in our dynamic programmingalgorithm.

Cost(Ω)opt = MIN1≤i≤n(|Pvi | ×Cost(Ωvi )opt)= MIN1≤i≤n(|Uvi | ×Cost(Ωvi )opt)

(3)

Obviously, it is easy to design a naive dynamic algorithm basedon Equation 3. However, it can be further optimized by recording5An algorithm is called fixed-parameter tractable for a problem ofsize l, with respect to a parameter n, means it can be solved intime O( f (n)g(l)), where f (n) can be any function but g(l) must bepolynomial [7].6 A problem is said to have optimal substructure if an optimal so-lution can be constructed efficiently from optimal solutions of itssubproblems [6]. This property is often used in dynamic program-ming formulations.

8

some intermediate results. Based on Equation 3, we can prove thefollowing equation.

Cost(Ω)opt = MIN1≤i≤n;1≤ j≤n;i, j(|Pvi | × |Pv j | ×Cost(Ωviv j )opt)= MIN1≤i≤n;1≤ j≤n;i, j(|Uvi | × |U

′v j| ×Cost(Ωviv j )opt)

(4)where Ωviv j denotes all local partial matches that do not containinternal vertices matching vi or v j, and U′v j

denotes all local partialmatches (in Ωvi ) that contain internal vertices matching vertex v j.

However, if Equation 4 is used naively in the dynamic program-ming formulation, it would result in repeated computations. For ex-ample, Cost(Ωv1v2 )opt will be computed twice in both |Uv1 | × |U

′v2| ×

Cost(Ωv1v2 )opt and |Uv2 | × |U′v1| × Cost(Ωv1v2 )opt. To avoid this, we

introduce a map that records Cost(Ω′) that have been already cal-culated (Line 16 in Function OptComCost), so that subsequent usesof Cost(Ω′) can be serviced directly from searching the map (Lines8-10 in Function ComCost).

We can prove that there are 2n items in the map (worst case),where n = |V(Q)|. Thus, the time complexity of the algorithm is(2n × |Ω|). Although the time complexity is still exponential, n (i.e.,|V(Q)|) is small in practice. For example, n is no larger than 10 inmost benchmark queries (such as LUBM benchmarks [10]).

5.2.3 Join OrderWhen we determine the optimal partitioning of local partial matches,

the join order is also determined. If the optimal partitioning isPopt = Pvk1

, ..., Pvkn and |Pvk1

| ≥ |Pvk2| ≥ ... ≥ |Pvkn

|, then thejoin order must be Pvk1

Z Pvk2Z ... Z Pvkn

. The reasons are asfollows.

First, changing the join order often can not prune any intermedi-ate results. Let us recall the example optimal partitioning Pv3 , Pv2 , Pv1

shown in Figure 6. The join order should be Pv3 Z Pv2 Z Pv1 , andany changes in the join order can not prune intermediate results.For example, if we first join Pv2 with Pv1 , we can not prune thelocal partial matches in Pv2 that can not join with any local partialmatches in Pv1 . This is because there may be some local partialmatches Pv3 that have an internal vertex matching v1 and can joinwith local partial matches in Pv2 . In other words, the results ofPv2 Z Pv1 is not smaller than Pv2 . Similarly, any other changes ofthe join order of the partitioning have no effects.

Second, in some special cases, the join order may have an effecton the performance. Given a partitioning Popt = Pvk1

, ..., Pvkn

and |Pvk1| ≥ |Pvk2

| ≥ ... ≥ |Pvkn|, if the set of the fist n′ vertices,

vk1 , vk2 , ..., vn′ , is a vertex cut of the query graph, the join order forthe remaining n − n′ partitions has an effect. In the worst case, ifthe query graph is a complete graph, the join order has no effect onthe performance.

5.3 Distributed AssemblyAn alternative to centralized assembly is to assemble the lo-

cal partial matches in a distributed fashion. We adopt Bulk Syn-chronous Parallel (BSP) model [31] to design our synchronous al-gorithm for distributed assembly. A BSP system consists of proces-sors connected by a communication network. A BSP computationproceeds in a series of global supersteps. A superstep consists ofthree components: local computation, communication and barriersynchronisation . In the following, we discuss how we apply thisstrategy to distributed assembly.

Local Computation. Each processor performs some computa-tion based on the data stored in the local memory. The compu-tations on different processors are independent in the sense thatdifferent processors perform the computation in parallel.

Consider the m-th superstep. For each fragment Fi, let ∆min(Fi)

denote all received intermediate results in the m-th superstep andΩm(Fi) denote all local partial matches and the intermediate results

generated the first (m − 1) supersteps. In the m-th superstep, wejoin local partial matches in ∆m

in(Fi) with local partial matches inΩm(Fi) by Algorithm 4. The basic idea of Algorithm 4 is similarto Algorithm 2. It is an iterative algorithm. For each intermediateresult PM, we check if it can join with some local partial matchPM′ in Ωm(Fi)

⋃∆m

in(Fi). If the join result PM′′ =PM Z PM′ is acomplete crossing match, it is returned. If the join result PM′′ is anintermediate result, we will check if PM′′ can further join with an-other local partial match in Ωm(Fi)

⋃∆m

in(Fi) in the next iteration.We also insert the intermediate result PM′′ into ∆m

out(Fi) that willbe sent to other fragments in the communication step discussed be-low. Of course, we can also use the partitioning-based solution (inSection 5.2.1) to optimize join processing, but we do not discussthat due to space limitation.

Algorithm 4: Local Computation in Each Fragment Fi

Input: Ωm(Fi), the local partial matches in fragment Fi

Output: RS , the crossing matches found at this superstep;∆m

out(Fi), the intermediate results that will be sent1 Let Ω = Ωm(Fi) ∪ ∆m

in(Fi)2 Set MS = ∆m

in(Fi)3 for N = 1 to |V(Q)| do4 if |MS |=0 then5 Break;6 Set MS ′ = φ7 for each local partial match PM in MS do8 for each local partial match PM′ in Ωm(Fi) ∪ ∆m

in(Fi)do

9 if PM and PM′ are joinable then10 PM′′ ← PM Z PM′

11 if PM′′ is a SPARQL match then12 Put PM′′ into the answer set RS13 else14 Put PM′′ into MS ′

15 Insert MS ′ into ∆mout(Fi)

16 Clear MS and MS ← MS ′

17 Ωm+1(Fi) = Ωm(Fi) ∪ ∆mout(Fi)

18 Return RS and ∆mout(Fi)

Communication. All processors exchange data among them-selves. Consider the m-th superstep. The straightforward commu-nication strategy works as follows. If an intermediate result PM in∆m

out(Fi) shares a crossing edge with fragment F j, PM will be sentto site S j from S i (assuming fragments Fi and F j are stored in sitesS i and S j, respectively).

However, the above communication strategy may generate dupli-cate results. For example, as shown in Figure 4, we can assemblePM4

1 (at site S 1) and PM13 (at site S 3) to form a complete crossing

match. According to the straightforward communication strategy,PM4

1 will be sent to S 3 from S 1. Then, we can get PM41 Z PM1

3at S 3. Analogously, we also send PM1

3 from S 3 to S 1 and as-semble them at site S 1. In other words, we obtain the join resultPM4

1 Z PM13 at both sites S 1 and S 3. This wastes resources and

increases total evaluation time.To avoid duplicate result computation, we introduce a “divide-

and-conquer” approach. We define a total order of fragment Fi

(i = 1, ..., k) in a non-descending order of |Ω(Fi)|, i.e., the number oflocal partial matches in fragment Fi found at the partial evaluationstage. Formally, the definition is as follows.

DEFINITION 5.5. Given any two fragments Fi and F j, Fi ≺ F j

if and only if |Ω(Fi)| ≤ |Ω(F j)| (1 ≤ i, j ≤ n).

9

Without loss of generality, we assume that F1 ≺ F2 ≺ ... ≺ Fn inthe remainder. The basic idea of the divide-and-conquer approachis as follows. Assume that a crossing match M is formed by joininglocal partial matches that are from different fragments Fi1 ,...,Fim ,where Fi1 ≺ Fi2 ≺ ... ≺ Fim (1 ≤ i1, ..., im ≤ n). The crossingmatch should only be generated at fragment site S im rather thanother fragment sites.

For example, at site S 2, we generate crossing matches by joininglocal partial matches from F1 and F2. The crossing matches gen-erated at S 2 should not contain any local partial match from F3 oreven larger fragments (such as F4,...,Fn). Similarly, at site S 3, weshould generate crossing matches by joining local partial matchesfrom F3 and fragments smaller than F3. The crossing matchesshould not contain any local partial match from F4 or even largerfragments (such as F5,...,Fn). As mentioned earlier, we define thetotal order for fragments Fi according to the non-descending orderof |Ω(Fi)|. The basic intuition is the “load balancing” of each site.

The above “divide-and-conquer” framework can avoid duplicateresults, since each crossing match can be only generated at a sin-gle site according to the “divided search space”. To enable the“divide-and-conquer” framework, we need to introduce some con-straints over data communication. The transmission (of local par-tial matches) from fragment site S i to S j is allowed only if Fi ≺ F j.

Let us consider an intermediate result PM in ∆mout(Fi). Assume

that PM is generated by joining intermediate results from m differ-ent fragments Fi1 , ..., Fim , where Fi1 ≺ Fi2 ≺ ... ≺ Fim . We sendPM to another fragment F j if and only if the two conditions hold:1) F j > Fim ; and 2) F j shares common crossing edges with at leastone fragment of Fi1 ,...,Fim .

Barrier Synchronisation. All communication in the m-th su-perstep should finish before entering in the (m + 1)-th superstep.

We now discuss the initial state (i.e., 0-th superstep) and the sys-tem termination condition.

Initial State. In the 0-th superstep, each fragment Fi has onlylocal partial matches in Fi, i.e, ΩFi . Since it is impossible to assem-ble local partial matches in the same fragment, the 0-th supersteprequires no local computation. It enters the communication stagedirectly. Each site S i sends ΩFi to other fragments according to thecommunication strategy that has been discussed before.

System Termination Condition. A key problem in the BSP al-gorithm is the number of the supersteps to terminate the system. Inorder to facilitate the analysis, we propose utilize a fragmentationgraph topology graph.

DEFINITION 5.6. (Fragmentation Topology Graph) Given afragmentation F = F1, F2, ..., Fn over an RDF graph G, the corre-sponding fragmentation topology graph T is defined as follows: 1).Each node in T is a fragment Fi, i = 1, ..., k; 2). There is an edgebetween nodes Fi and F j in T , 1 ≤ i , j ≤ n, if and only if there isat least one crossing edge between Fi and F j in RDF graph G.

Let Dia(T ) be the diameter of T . We need at most Dia(T ) su-persteps to transfer the local partial matches in one fragment Fi toany other fragment F j. Hence, the number of the supersteps in theBSP-based algorithm is Dia(T ).

6. EXPERIMENTSWe evaluate our method over both real and synthetic RDF datasets,

and compare our approach with the state-of-the-art distributed RDFsystems, including both federated SPARQL query systems (DARQ[27], FedX [28] and Q-Tree [26]) and RDF data partition-basedmethods (GraphPartition [15] and EAGRE [34].)

Setting. We use one synthetic dataset with different sizes, andtwo real datasets in our experiments. Table 1 summarizes the statis-tics of these datasets. All sample queries are shown in Appendix.

1) LUBM [10] is a benchmark that adopts an ontology for theuniversity domain, and can generate synthetic OWL data scalableto an arbitrary size. We vary the university number from 1000 to10000. The number of triples is 133 million to 1.33 billion. Onthese experiments, we partition the LUBM datasets according tothe university identifiers. We use the 7 benchmark queries in [2] toevaluate our method.

2) BTC 2012 (http://km.aifb.kit.edu/projects/btc-2012/) is a datasetthat serves as basis for submissions to the Billion Triples Trackof the Semantic Web Challenge. After eliminating all redundanttriples, this dataset contains about 1 billion triples. We use METIS[19], a popular min-cut based graph partition algorithm, to partitionit into several fragments. In addition, we use 7 queries in [34].

3) LOD is a dataset generated from ten datasets in Linked OpenData to further evaluate our methods in a real federated databasesystem environment. We download ten datasets (DBpedia, Linked-MDB, Yago2, FreeBase, NYTimes, MusicBrainz, Diseasome, Geon-ames, Timbl, LinkedGeoData). By randomly selecting one as theprimary and others as the copies of the primary, we merge entitiesin different datasets connected using predefined property “owl:sameAs”.Each dataset maps to a fragment. We also define six sample queriesacross different fragments.

Dataset Number ofTriples

RDF N3 FileSize(KB)

Number ofEntities

LUBM 1000 133,553,834 15,136,798 21,715,108LUBM 5000 667,292,653 76,522,983 108,498,627LUBM 10000 1,334,481,197 153,256,699 217,006,852

BTC 1,056,184,911 238,970,296 183,835,054LOD 1,055,204,649 169,175,000 146,228,189

Table 1: Datasets

We conduct all experiments on a cluster of 10 machines runningLinux, each of which has one CPU with four cores of 3.06GHz,16GB memory and 500GB disk storage. At each site, we installgStore [36] to find inner matches, since it supports the graph-basedSPARQL evaluation paradigm. We revise gStore to find all localpartial matches in each fragment as discussed in Section 4. Allimplementations are in standard C++. We use MPICH-3.0.4 libraryfor communication.

Exp 1. Evaluating Each Stage’s Performance. In this ex-periment, we study the performance of our system in each stage(i.e., partial evaluation and assembly process) with regard to dif-ferent queries in LUBM1000. We report the running time of eachstage (i.e., partial evaluation and assembly) and the number of lo-cal partial matches, inner matches, crossing matches, with regardto different queries in Table 2. We also compare the centralizedand distributed assembly strategies. Note that we classify SPARQLqueries into three categories according to query graphs’ structures:star, snowflake (several stars linked by a path) and complex (a com-bination of the above with complex structure).

Partial Evaluation: Table 2 shows that if there are some consis-tent entities (such as “http://www.Department0.University0.edu” inQ4 of LUBM) in the query, the partial evaluation is much faster thanothers. Our partial evaluation algorithm (Algorithm 1) is based ona state transformation, while the consistent entities can reduce thesearch space. Furthermore, the running time also depends on thenumber of inner matches and local partial matches, as shown in Ta-ble 2. More inner matches and local partial matches lead to higherrunning time in the partial evaluation stage.

Assembly: In this experiment, we compare centralized and dis-tributed assembly approaches. Obviously, there is no assembly pro-cess for a star query. Thus, we only study the performances ofsnowflake and complex queries. We find that distributed assem-bly can beat the centralized one when there are lots of local partial

10

Partial Evaluation Assembly TotalTime(in ms) Time(in ms)

Time(in ms) # of LPMs2 # of IMs3 Centralized Distributed # of CMs4 Partial Evaluation &Centralized Assembly

Partial Evaluation &Distributed Assembly # of Matches5

StarQ2 1818 0 1081187 0 0 0 1818 1818 1081187Q4

√1 82 0 10 0 0 0 82 82 10Q5

√8 0 10 0 0 0 8 8 10

Snowflake Q6√

158 6707 110 164 125 15 322 283 125

ComplexQ1 52548 3033 2524 53 60 4 52601 52608 2528Q3 920 3358 0 36 48 0 956 968 0Q7 3945 167621 42479 211670 35856 1709 215615 39801 44190

1 √ means that the query involves some consistent entities.2 “# of LPMs” means the number of local partial matches.3 “# of IMs” means the number of inner matches.4 “# of CMs” means the number of crossing matches.5 “# of Matches” means the number of matches.

Table 2: Evaluation of Each Stage

matches and crossing matches. The reason is as follows: in cen-tralized assembly, all local partial matches need to be sent to theserver where they are assembled. Obviously, if there are lots of lo-cal partial matches, the server becomes the bottleneck. However, indistributed assembly, we can take advantage of the parallelizationto speed up both the network communication and assembly. For ex-ample, in Q7, there are 167621 local partial matches. It takes a longtime to transfer the local partial matches to the server and assemblethem in the server in centralized assembly. So, distributed assemblyoutperforms the centralized one by six times. However, if the num-ber of local partial matches and the number of crossing matches aresmall, the barrier synchronisation cost dominates the total cost indistributed assembly. In this case, the advantage of distributed as-sembly is not clear. A quantitative comparison between distributedassembly and centralized one needs more statistics about the net-work communication, CPU and other parameters. A sophisticatedquantitative study is beyond the scope of this paper and is left asfuture work.

Exp 2: Evaluating Optimizations in Assembly. In this experi-ment, we evaluate two different optimization techniques in the as-sembly: partition-based join strategy (Section 5.1) and the divide-and-conquer approach in the distributed assembly (Section 5.3).Star queries Q2, Q4 and Q5 do not need the assembly process.Query Q3 does not have any match in RDF graph G. Therefore,we only use the three benchmark queries Q1, Q6 and Q7 in ourexperiments.

Partitioning-based Join. First, we compare partitioning-basedjoin (i.e., Algorithm 3) with naive join processing (i.e., Algorithm2) in Table 3, which shows that the partitioning-based strategy cangreatly reduce the join cost. Second, we evaluate the effectivenessof our cost model. Note that the join order depends on the parti-tioning strategy, which is based on our cost model as discussed inSection 5.2.2. In other words, once the partitioning is given, thejoin order is fixed. So, we use the cost model to find the optimalpartitioning and report the running time of the assembly processin Table 3. We find that the assembly with optimal partitioning isfaster than that with the random partitioning, which confirms theeffectiveness of our cost model.

Partitioning-based Join Basedon the Optimal Partitioning

Partitioning-based Join Basedon the Random Partitioning Naive Join

Q1 53 55 3180Q6 164 177 15382Q7 211670 236012 10876991

Table 3: Running Time of Partitioning-based Join VS. NaiveJoin (in ms)

Divide-and-Conquer in Distributed Assembly. Table 4 shows thatthe dividing search space will speed up the distributed assembly. If

we do not divide the search space, we may generate some dupli-cate results, as discussed in Section 5.3. Therefore, dividing searchspace uses the parallelization to speed up distributed assembly.

Distributed Assembly Time (in ms)Dividing No Dividing

Q1 60 86Q6 125 137Q7 35856 38196

Table 4: Dividing VS. No Dividing(in ms)

Exp 3: Scalability Test. In this experiment, we vary the RDFdataset size from 133 million triples (LUBM 1000) to 1.3 billiontriples (LUBM 10000) to study the scalability of our methods. Ta-ble 5 shows the performance of different queries of both centralizedand distributed assembly.

From LUBM1000 to LUBM10000, the data size increases 10times and this causes the query response times to increase 10-20times. The increase is also affected by the query graph’s shape.For star queries, the query response time increases proportional tothe datasize, as shown in Table 5. For snowflake/complex queries,the query response times may grow faster than the data size, asshown in Table 5. Especially for Q7, the query response time in-creases 20 times as the datasize increases 10 times. This is becausethe complex query graph shape causes more complex operationsin query processing, such as joining and assembly. However, evenfor complex queries, the growth of the query response times areapproximatively linear with the RDF graph size.

Note that, as discussed in Experiment 1, the assembly times ofstar queries (Q2, Q4 and Q5) are 0. Therefore, the query responsetimes for star queries in centralized and distributed assembly arethe same. In contrast, for other shapes of queries, some local partialmatches and crossing matches result in some differences betweenthe performances of centralized and distributed assembly. Here,Q3 is a special case. Although it is a complex query, there arefew local partial matches for Q3. Furthermore, the crossing matchnumber is 0 in Q3 (in Table 2). Therefore, the assembly times forQ3 are so small that the query response times in both centralizedand distributed assembly are the almost same.

Exp 4: Comparing with State-of-the-Art Algorithms. In thisexperiment, we compare our approach with some state-of-the-artdistributed RDF systems, including both federated SPARQL querysystems (DARQ [27], FedX [28] and Q-Tree [26]) and RDF datapartition-based methods (GraphPartition [15] and EAGRE [34].)

Comparing with Federated SPARQL queries. For further eval-uating our methods, we download ten datasets (DBPedia, Linked-MDB, Yago2, FreeBase, NYTimes, MusicBrainz, Diseasome, Geon-ames, Timbl, LinkedGeoData) from LOD to integrate a real LOD

11

LUBM 1000 LUBM 5000 LUBM 10000

Q1

Partial Evaluation &Centralized Assembly 52601 274320 563037

Partial Evaluation &Distributed Assembly 52608 286320 588031

Q2



Q3



Q4



Q5



Q6



Q7



Table 5: Scalability Test (in ms)

dataset. Each dataset maps to a fragment and is stored at a site. Wealso define six queries across different fragments. Two of these sixqueries (Q1 and Q2) are stars, two (Q3 and Q4) are snowflakes andtwo (Q5 and Q6) are complex queries. Furthermore, three queries(Q1, Q3 and Q5) contain selective triple patterns, while the others(Q2, Q4 and Q6) do not. We report these six sample queries inAppendix.

We compare our method with FedX and DARQ in the LODdataset in Figure 7, which shows that our method outperform FedXand DARQ by almost one order of magnitude. Furthermore, FedXand DARQ cannot answer Q4 and Q6 because it runs of memory.FedX and DARQ decompose the SPARQL into a set of subqueries,and join the intermediate results of all subqueries together to findthe final results. When the intermediate results of two subqueriesjoin together, FedX and DARQ use the bound join. This meansthat FedX and DARQ first use the intermediate results to rewritethe subquery with bound join variables. Then, they send out andevaluate the rewritten queries in the corresponding sites. Therefore,when the SPARQL queries (like Q2, Q4 and Q6) do not contain anyselective triple pattern, the size of intermediate results is so largethat evaluation of bound joins takes lots of time. In Q4 and Q6, theintermediate results exhausted the available memory. Q-Tree [26]optimizes the query process by encoding the intermediate resultsin bitsets and uses bitwise AND-operations to filter out irrelevantintermediate results before join. However, our method still outper-forms Q-Tree by at least three times except for Q1, as shown inFigure 7. Our method is faster than Q-Tree on Q1 by 60%, sinceQ1 is a simple star query with a selective triple pattern.

Comparing with RDF Data Partition-based Methods. In this ex-periment, we compare our method with data partition-based meth-ods. It means that we assume that all RDF data are downloadedinto a server. Then, we partition the whole RDF graph into severalfragments that are stored in a cluster of machines. We consider twostate-of-the-art algorithms, i.e., GraphPartition [15] and EAGRE[34] on two RDF datasets with more than one billion triples, BTCand LUBM 10000. Figure 8 shows the performance of differentapproaches in the two RDF datasets. We find out that our methodoutperforms GraphPartition and EAGRE greatly in most cases. For

Q1 Q2 Q3 Q4 Q5 Q6102

103

104

105

106

Queries

Que

ryR

espo

nse

Tim

e(i

nm

s) DARQFedXQ-TreePartial Evaluation and Centralized AssemblyPartial Evaluation and Distributed Assembly

outo

fmem

ory

outo

fmem

ory

outo

fmem

ory

outo

fmem

ory

Figure 7: Online Performance over LOD

example, in BTC, our method is faster than both GraphPartitionand EAGRE by an order of magnitude for queries Q4, Q5 and Q6.

There are two reasons for this phenomena. First, both Graph-Partition and EAGRE are based on query decomposition. A queryis decomposed into a set of subqueries and each subquery is per-formed locally on each fragment. They do no consider the wholequery graph structure to prune the intermediate results. However,in our method, each fragment receives the whole query graph, andcan utilize the query graph structure to prune more false positiveintermediate results. Second, both GraphPartition and EAGRE areimplemented on Hadoop while our system is running with MPI, alightweight parallel library. According to our experiments, startingHadoop jobs always spends 20-30 seconds.

In particular, for star queries (such as Q2, Q4 and Q5 of LUBM),since their diameters are not larger than 2, GraphPartition can an-swer all these queries entirely without data communication in Hadoop.Therefore, the query response time of GraphPartition for these queriesis as fast as our methods. However, for snowflake/complex querieswhose diameters are larger than 2, GraphPartition needs to decom-pose the queries and join the intermediate results. Then, the queryresponse time grows greatly. The performance differences also de-pend on the intermediate result sizes. Q1 has more intermediate re-sults, thus, the performance difference is not clear comparing withGraphPartition and EAGRE.

Finally, our method does not depend any specific graph parti-tioning strategy (which is known to be NP-hard). However, thequery algorithms of GraphPartition and EAGRE depend on theirown graph partitions. Therefore, neither GraphPartition nor EA-GRE can be used directly in the virtually integrated distributedRDF graphs, such as LOD data, since there is no possibility ofchanging the partitioning of the data.

Q1 Q2 Q3 Q4 Q5 Q6 Q7

101

103

105

107

109

1011

Queries

Que

ryR

espo

nse

Tim

e(i

nm

s) GraphPartitionEAGREPartial Evaluation and Centralized AssemblyPartial Evaluation and Distributed Assembly

(a) LUBM 10000

Q1 Q2 Q3 Q4 Q5 Q6 Q7103

104

105

Queries

Que

ryR

espo

nse

Tim

e(i

nm

s) GraphPartitionEAGREPartial Evaluation and Centralized AssemblyPartial Evaluation and Distributed Assembly

(b) BTCFigure 8: Online Performance Comparison

7. CONCLUSIONIn this paper, we propose a distributed graph-based approach to

executing SPARQL queries in a federated RDF repository as in thelinked open data. We treat the interconnected RDF repositories (inLOD) as a virtually integrated distributed database, and adopt apartial function evaluation approach. This is different from most

12

existing distributed SPARQL query algorithms, since they requirere-partitioning the whole RDF graph as their own methods require.In contrast, in LOD, the RDF data are already distributed over dif-ferent sites, thus, repartitioning and redistributing are usually notpossible. Our experiments show that the proposed algorithms per-form very well. It does require minor modifications to SPARQLendpoints to evaluate partial queries.

There are a number of extensions we are currently working on.An important one is handling data at sites that do not have querycapability (i.e., they are not SPARQL endpoints). Data at thesesites need to be moved for processing that will affect the algorithmand cost functions. Furthermore, similarity search over LOD is alsoan ongoing work. Users may not specify the exact SPARQL querystatements, because they have no full knowledge of the underlyingschema of RDF graph in LOD.

8. REFERENCES[1] D. J. Abadi, A. Marcus, S. Madden, and K. Hollenbach. Sw-store: a

vertically partitioned dbms for semantic web data management.VLDB J., 18(2):385–406, 2009.

[2] M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler. Matrix "bit"loaded: a scalable lightweight join query processor for RDF data. InWWW, pages 41–50, 2010.

[3] P. Buneman, G. Cong, W. Fan, and A. Kementsietsidis. Using partialevaluation in distributed query evaluation. In VLDB, 2006.

[4] G. Cong, W. Fan, and A. Kementsietsidis. Distributed queryevaluation with performance guarantees. In SIGMOD Conference,pages 509–520, 2007.

[5] G. Cong, W. Fan, A. Kementsietsidis, J. Li, and X. Liu. Partialevaluation for distributed xpath query processing and beyond. ACMTrans. Database Syst., 37(4):32, 2012.

[6] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.Introduction to Algorithms (3. ed.). MIT Press, 2009.

[7] R. G. Downey, M. R. Fellows, A. Vardy, and G. Whittle. Theparametrized complexity of some fundamental problems in codingtheory. SIAM J. Comput., 29(2):545–570, 1999.

[8] W. Fan, X. Wang, and Y. Wu. Performance guarantees for distributedreachability queries. PVLDB, 5(11):1304–1315, 2012.

[9] W. Fan, X. Wang, Y. Wu, and D. Deng. Distributed graph simulation:Impossibility and possibility. PVLDB, 7(12):1083–1094, 2014.

[10] Y. Guo, Z. Pan, and J. Heflin. Lubm: A benchmark for owlknowledge base systems. J. Web Sem., 3(2-3):158–182, 2005.

[11] S. Gurajada, S. Seufert, I. Miliaraki, and M. Theobald. Triad: adistributed shared-nothing RDF engine based on asynchronousmessage passing. In SIGMOD Conference, pages 289–300, 2014.

[12] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K.-U. Sattler, andJ. Umbrich. Data summaries for on-demand queries over linked data.In WWW, pages 411–420, 2010.

[13] O. Hartig, C. Bizer, and J.-C. Freytag. Executing sparql queries overthe web of linked data. In Proc. Int. Conf. on Semantic Web, pages293–309, 2009.

[14] O. Hartig and M. T. Özsu. Linked data query processing (Tutorial).In Proc. 30th IEEE Int. Conf. on Data Engineering, pages1286–1289, 2014.

[15] J. Huang, D. J. Abadi, and K. Ren. Scalable sparql querying of largeRDF graphs. PVLDB, 4(11):1123–1134, 2011.

[16] M. F. Husain, J. P. McGlothlin, M. M. Masud, L. R. Khan, and B. M.Thuraisingham. Heuristics-based query processing for large RDFgraphs using cloud computing. IEEE Trans. Knowl. Data Eng.,23(9):1312–1327, 2011.

[17] N. D. Jones. An introduction to partial evaluation. ACM Comput.Surv., 28(3):480–503, 1996.

[18] Z. Kaoudi and I. Manolescu. RDF in the clouds: A survey. VLDB J.,2014.

[19] G. Karypis and V. Kumar. Analysis of multilevel graph partitioning.In SC, 1995.

[20] V. Khadilkar, M. Kantarcioglu, B. M. Thuraisingham, andP. Castagna. Jena-HBase: A distributed, scalable and effcient RDF

triple store. In International Semantic Web Conference (Posters &Demos), 2012.

[21] K. Lee and L. Liu. Scaling queries over big RDF graphs withsemantic hash partitioning. PVLDB, 6(14):1894–1905, 2013.

[22] K. Lee, L. Liu, Y. Tang, Q. Zhang, and Y. Zhou. Efficient andcustomizable data partitioning framework for distributed big RDFdata processing in the cloud. In IEEE CLOUD, pages 327–334, 2013.

[23] S. Ma, Y. Cao, J. Huai, and T. Wo. Distributed graph patternmatching. In Proceedings of the 21st World Wide Web Conference2012, WWW 2012, Lyon, France, April 16-20, 2012, pages 949–958,2012.

[24] T. Neumann and G. Weikum. RDF-3x: a risc-style engine for rdf.PVLDB, 1(1):647–659, 2008.

[25] J. Pérez, M. Arenas, and C. Gutierrez. Semantics and complexity ofSPARQL. ACM Trans. Database Syst., 34(3), 2009.

[26] F. Prasser, A. Kemper, and K. A. Kuhn. Efficient distributed queryprocessing for autonomous RDF databases. In EDBT, pages372–383, 2012.

[27] B. Quilitz and U. Leser. Querying distributed RDF data sources withSPARQL. In The Semantic Web: Research and Applications, 5thEuropean Semantic Web Conference, ESWC 2008, Tenerife, CanaryIslands, Spain, June 1-5, 2008, Proceedings, pages 524–538, 2008.

[28] A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt. Fedx:A federation layer for distributed query processing on linked opendata. In The Semanic Web: Research and Applications - 8th ExtendedSemantic Web Conference, ESWC 2011, Heraklion, Crete, Greece,May 29 - June 2, 2011, Proceedings, Part II, pages 481–486, 2011.

[29] H. Shang, Y. Zhang, X. Lin, and J. X. Yu. Taming verificationhardness: an efficient algorithm for testing subgraph isomorphism.PVLDB, 1(1):364–375, 2008.

[30] B. Shao, H. Wang, and Y. Li. Trinity: a distributed graph engine on amemory cloud. In SIGMOD Conference, pages 505–516, 2013.

[31] L. G. Valiant. A bridging model for parallel computation. Commun.ACM, 33(8):103–111, 1990.

[32] K. Wilkinson, C. Sayers, H. A. Kuno, and D. Reynolds. Efficient rdfstorage and retrieval in jena2. In SWDB, pages 131–150, 2003.

[33] K. Zeng, J. Yang, H. Wang, B. Shao, and Z. Wang. A distributedgraph engine for web scale RDF data. PVLDB, 6(4):265–276, 2013.

[34] X. Zhang, L. Chen, Y. Tong, and M. Wang. Eagre: Towards scalablei/o efficient sparql query evaluation on the cloud. In ICDE, pages565–576, 2013.

[35] X. Zhang, L. Chen, and M. Wang. Towards efficient join processingover large RDF graph using mapreduce. In SSDBM, pages 250–259,2012.

[36] L. Zou, M. T. Özsu, L. Chen, X. Shen, R. Huang, and D. Zhao.Answering pattern match queries in large graph databases via graphembedding. VLDB J., 2013.

13

APPENDIXA. PROOFS

Theorem 4.1. Given any crossing match M of SPARQL queryQ in an RDF graph G, if M overlaps with some fragment Fi (i =

1, ..., k), the overlapping part between M and fragment Fi, denotedas PM, must be a local partial match in fragment Fi.

PROOF. Since M is a SPARQL query match and the overlap-ping part is a subset of the SPARQL match, it is easy to know thatConditions 1-6 of Definition 4.1 hold.

Since M is a crossing match, the overlapping part between Mand Fi must contain a crossing edge, thus, Condition 7 holds.

Since the overlapping part PM is a subgraph of match M, ac-cording to Claim 1, we can prove that Condition 8 holds.

If the overlapping part between M and fragment Fi has sev-eral weakly connected components, we consider each weakly con-nected component individually. Obviously, each weakly connectedcomponent satisfies Condition 9.

To summarize, the overlapping part between M and fragmentFi satisfies all conditions in Definition 4.1. Thus, Theorem 4.1holds.

Theorem 4.2. The results generated by the partial-evaluation-and-assembly algorithm satisfy no-false-negative requirement if andonly if all local partial matches in each fragment are found in thepartial evaluation stage.

PROOF. In two parts:1) The “If” part: (proven by contradiction).Assume that all local partial matches are found in each fragment

Fi but a cross match M is missed in the answer set.Since M is a crossing match, suppose that M overlaps with m

fragments F1,...,Fm. According to Theorem 4.1, the overlappingpart between M and Fi (i = 1, ...,m) must be a local partial matchPMi in Fi. According to the assumption, these local partial matcheshave been found in the partial evaluation stage. Obviously, we canassemble these partial matches PMi (i = 1, ...,m) to form the com-plete cross match M.

In other words, M would not be missed if all local partial matchesare found. This contradicts the assumption.

2) The “Only If” part: (proven by contradiction).We assume that a local partial match PMi in fragment Fi is

missed and the answer set can still satisfy no-false-negative require-ment.

Suppose that PMi matches a part of Q, denoted as Q′. Assumethat there exists another local partial match PM j in F j that matchesa complementary graph of Q′, denoted as Q′, i.e., Q′

⋃Q′ = Q.

In this case, we can obtain a complete match M by assembling thetwo local partial matches. If PMi in Fi is missed, then match Mis missed. In other words, it cannot satisfy the no-false-negativerequirement. This also contradicts the assumption.

Theorem 4.3. Given a query graph Q and an RDF graph G, ifPMi is a local partial match under function f in fragment Fi =

(V ∪ Vei , E ∪ Ec

i ), there exists no local partial match PM′i under

function f ′ in Fi, where f ⊂ f ′.

PROOF. (by contradiction) Assume that there exists another lo-cal partial match PM′

i of query Q in fragment Fi, where PMi is asubgraph of PM′

i . Since PMi is a subgraph of PM′i , there must ex-

ist at least one edge e =−−→uu′ where e ∈ PM′

i and e < PMi. Assume

that−−→uu′ is matching edge

−→vv′ in query Q. Obviously, at least one

endpoint of e should be an internal vertex. We assume that u is aninternal vertex. According to Condition (8) of Definition 4.1 and

1

6 2( )a PM3

3 1( )a PM4

4 1( )a PM5

5 1( )a PM1

1 1( )a PM2

7 2( )a PM3

8 2( )a PM1

10 4( )a PM1

9 3( )a PM2

2 1( )a PM

1 1( )b v 2 2( )b v 3 3( )b v 4 4( )b v 5 5( )b v6 6( )b v

Figure 9: Example Bipartite Graph

Claim (1), edge−→vv′ should also be matched in PM, since PM is

a local partial match. However, edge−−→uu′ (matching

−→vv′) does not

exist in PM. This contracts PM being a local partial match. Thus,Theorem 4.3 holds.

Theorem 5.1. Given two local partial matches PMi and PM j

from fragments Fi and F j with functions fi and f j, respectively, ifthere exists a query vertex v where both fi(v) and f j(v) are internalvertices of fragments Fi and F j, respectively, PMi and PM j are notjoinable.

PROOF. If fi(v) , f j(v), then a vertex v in query Q matches twodifferent vertices in PMi and PM j, respectively. Obviously, PMi

and PM j cannot be joinable.If fi(v) = f j(v), since fi(v) and f j(v) are both internal vertices,

both PMi and PM j are from the same fragment. As mentionedearlier, it is impossible to assemble two local partial matches fromthe same fragment (see the first paragraph of Section 5.1), thus,PMi and PM j cannot be joinable.

Theorem 5.2. Finding the optimal partition is NP-complete prob-lem.

PROOF. We can reduce a 0-1 integer planning problem to find-ing the optimal partition. We build a bipartite graph B, which con-tains two vertex groups B1 and B2. Each vertex a j in B1 correspondsto a local partial match PM j in Ω, j = 1, ..., |Ω|. Each vertex bi inB2 corresponds to a query vertex vi, i = 0, ..., n. We introduce anedge between a j and bi if and only if PM j has a internal vertex thatis matching query vertex bi. Let a variable x ji denote the edge labelof the edge a jbi. Let us recall all local partial matches in Figure 4.Figure 9 shows an example bipartite graph.

We define the 0-1 integer planning problem as follow:

min∏i=n

i=0 (∑

j x ji + 1)st.∀ j,

∑i x ji = 1

The above equation means that each local partial match should beassigned to only one query vertex.

It is easy to know the equivalence between the 0-1 integer plan-ning and finding the optimal partition. The former is a classicalNP-complete problem. Thus, we know the theorem holds.

Theorem 5.3. Given a query graph Q with n vertices v1,...,vn

and a set of all local partial matches Ω, let Uvi (i = 1, ..., n) be alllocal partial matches (in Ω) that have internal vertices matching vi.For the optimal partitioning Popt = Pv1 , .., Pvn where Pvn has thelargest partition size (i.e., the number of local partial matches in Pvn

is maximum) in Popt, Pvn = Uvn .

PROOF. (by contradiction) Assume that Pvn , Uvn in the op-timal partitioning Popt = Pv1 , .., Pvn . Then, there exists a local

14

partial match PM < Pvn and PM ∈ Uvn . We assume that PM ∈ Pv j ,j , n. The cost of Popt = Pv1 , .., Pvn is:

Cost(Ω)opt = (∏

1≤i<n∧i, j

(|Pvi | + 1)) × (|Pv j | + 1) × (|Pvn | + 1) (5)

Since PM ∈ Uvn , PM has an internal vertex matching vn. Hence,we can also put PM into Pvn . Then, we get a new partitioningP′ = Pv1 , ..., Pv j − PM, ..., , Pvn ∪ PM. The cost of the newpartitioning is:

Cost(Ω) = (∏

1≤i<n∧i, j

(|Pvi | + 1)) × |Pv j | × (|Pvn | + 2) (6)

Let C =∏

1≤i<n∧i, j(|Pvi | + 1), which exists in both Equations 5and 6. Obviously, C > 0.

Cost(Ω)opt −Cost(Ω)= C × (|Pvn | + 1) × (|Pv j | + 1) −C × (|Pvn | + 2) × (|Pv j |)= C × (|Pvn | + 1 − |Pv j |)

Because Pvn is the largest partition in Popt, |Pvn | + 1 − |Pv j | > 0.Furthermore, C > 0. Hence, Cost(Ω)opt − Cost(Ω) > 0, meaningthat the optimal partitioning has larger cost. Obviously, this cannothappen.

Therefore, in the optimal partitioningPopt, we cannot find a localpartial match PM, where |Pvn | is the largest, PM < Pvn and PM ∈Uvn . In other words, Pvn = Uvn in the optimal partitioning.

B. QUERIES IN EXPERIMENTS

B.1 LUBM QueriesQ1:Select ?x,?y,?z where ?x rdf:type ub:GraduateStudent.?y

rdf:type ub:University.?z rdf:type ub:Department.?x ub:memberOf?z.?z ub:subOrganizationOf ?y.?x ub:undergraduateDegreeFrom ?y

Q2:Select ?x where ?x rdf:type ub:Course.?x ub:name ?y.

Q3:Select ?x,?y,?z where ?x rdf:type ub:UndergraduateStudent.?y rdf:type ub:University. ?z rdf:type ub:Department. ?x ub:memberOf?z. ?z ub:subOrganizationOf ?y. ?x ub:undergraduateDegreeFrom?y

Q4:Select ?x where ?x ub:worksFor http://www.Department0.University0.edu. ?x rdf:type ub:FullProfessor. ?x ub:name ?y1. ?xub:emailAddress ?y2. ?x ub:telephone ?y3.

Q5:Select ?x where ?x ub:subOrganizationOf http://www. De-partment0.University0.edu. ?x rdf:type ub:ResearchGroup.

Q6:Select ?x,?y where ?y rdf:type ub:Department. ?y ub: sub-OrganizationOf http://www.University0.edu. ?x ub:worksFor ?y.?x rdf:type ub:FullProfessor.

Q7:Select ?x,?y,?z where ?x rdf:type ub:UndergraduateStudent.?y rdf:type ub:FullProfessor. ?z rdf:type ub:Course. ?x ub:advisor?y. ?x ub:takesCourse ?z. ?y ub:teacherOf ?z.

B.2 BTC QueriesQ1:Select ?lat,?long where ?a [] “Eiffel Tower"@en. ?a <http://

www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat. ?a <http://www.w3.org/2003/01/geo/wgs84_pos#long> ?long. ?a <http://dbpedia.org/ontology/location> <http://dbpedia.org/resource/France>.

Q2:Select ?b,?p,?bn where ?a [] “Tim Berners-Lee"@en. ?a<http://dbpedia.org/property/dateOfBirth> ?b. ?a <http://dbpedia.org/property/placeOfBirth> ?p. ?a<http://dbpedia.org/property/name>?bn.

Q3:Select ?t,?lat,?long where ?a <http://dbpedia.org/property/wikiLinks><http://dbpedia.org/resource/List_of_World_Heritage_Sites_in_Europe>. ?a <http://dbpedia.org/property/title> ?t. ?a<http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat. ?a <http://www.w3.org/2003/01/geo/wgs84_pos#long> ?long. ?a<http:// db-pedia .org/property/wikiLinks><http://dbpedia.org/resource/Middle_Ages>.

Q4:Select ?l,?long,?lat where ?p<http://dbpedia.org/property/name>“Krebs, Emil"@en. ?p <http://dbpedia.org/property/deathPlace>?l. ?c [] ?l. ?c <http://www.geonames.org/ontology#featureClass><http://www.geonames.org/ontology#P>. ?c<http://www.geonames.org/ontology#inCountry><http://www.geonames.org/countries/#DE>.?c<http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat. ?c<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?long.

Q5:Select distinct ?l,?long,?lat where ?a [] “Barack Obama”@en.?a<http://dbpedia.org/property/placeOfBirth> ?l. ?l<http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat. ?l<http://www.w3.org/2003/01/geo/wgs84_pos#long> ?long.

Q6:Select distinct ?d where ?a <http://dbpedia.org/property/senators> ?c. ?a<http://dbpedia.org/property/name> ?d. ?c<http://dbpedia.org/property/profession><http://dbpedia.org/resource/ Vet-erinarian >. ?a <http://www.w3.org/2002/07/owl#sameAs> ?b. ?b<http://www.geonames.org/ontology#inCountry><http://www. geon-ames .org/countries/#US>.

Q7:Select distinct ?a,?b,?lat,?long where ?a<http://dbpedia.org/property/spouse> ?b. ?a <http://dbpedia.org/property/wikiLinks><http://dbpedia.org/property/actor>. ?b<http://dbpedia.org/property/wikiLinks><http://dbpedia.org/property/actor>. ?a<http://dbpedia.org/property/placeofbirth> ?c. ?b<http://dbpedia.org/property/ place-ofbirth > ?c. ?c <http://www.w3.org/2002/07/owl#sameAs> ?c2.?c2 <http://www.w3.org/2003/01/geo/wgs84_pos#lat> ?lat. ?c2 <http://www.w3.org/2003/01/geo/wgs84_pos#long> ?long.

B.3 LOD QueriesQ1:Select ?p where ?p <http://www.mpii.de/yago/resource/ ha-

sOfficialLanguage> <http://www.mpii.de/yago/resource/English_ lan-guage>. ?p <http://www.w3.org/1999/02/22-rdf-syntax-ns#type><http:// dbpedia .org/ontology/Country>.

Q2:Select ?p where ?p <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.mpii.de/yago/resource/wordnet_scientist_110560637>.?p <http://dbpedia.org/ontology/nationality> <http://dbpedia.org/ re-source /United_States>.

Q3:Select distinct ?a where ?a <http://www.mpii.de/yago/resource/wasBornIn> <http://www.mpii.de/yago/resource/Chicago>. ?a <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.mpii.de/yago/resource/wordnet_actor_109765278> ?f <http://dbpedia.org/ontology/starring> ?a ?f <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://data.linkedmdb.org/resource/movie/film>

Q4:Select ?a where ?a <http://dbpedia.org/ontology/residence>?c. ?c <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Place>. ?a <http://www.mpii.de/yago/resource/directed> ?m. ?m <http://data.linkedmdb.org/resource/movie/actor>?a.

Q5:Select ?x,?y where ?x <http://www.mpii.de/yago/resource/diedIn><http://www.mpii.de/yago/resource/Athens>. ?x <http://www.mpii.de/yago/resource/influences> ?y. ?x <http://dbpedia.org/ontology/mainInterest>?z. ?y <http://dbpedia.org/ontology/mainInterest> ?z. ?y <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Philosopher>.

15

Q6:Select distinct ?n1,?n2 where ?f <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Film>. ?f <http://dbpedia.org/ontology/starring> ?person2. ?person1 <http://www.mpii.de/yago/resource/isMarriedTo> ?person2. ?person2 <http://www.mpii.de/yago/resource/hasPreferredName> ?n2. ?f <http://data.linkedmdb.org/resource/movie/producer> ?person1. ?person1 <http://data.linkedmdb.org/resource/movie/producer_name> ?n1.

B.4 Queries with Different Diameters in Ex-periment 1

Diameter Queries

1 Select ?x, ?y where ?x ub:undergraduateDegreeFrom http://www.University0.edu.

2Select ?x, ?y where ?x ub:advisor ?a.?x ub:undergraduateDegreeFrom http://www.University0.edu.

3

Select ?x, ?y where ?x ub:advisor ?a.?x ub:undergraduateDegreeFrom ?y.?y ub:name University0.

4

Select ?x, ?y where ?a ub:teacherOf ?c.?x ub:advisor ?a.?x ub:undergraduateDegreeFrom ?y.?y ub:name University0.

Table 6: Queries with Different Diameters

16

Documents

Processing SPARQL Queries Over Linked Data— A ...cobweb.cs.uga.edu/~kochut/teaching/8350/Papers/Ontologies/Queryin… · Processing SPARQL Queries Over Linked Data— A Distributed