27
Distrib Parallel Databases DOI 10.1007/s10619-014-7142-1 Approximate querying of RDF graphs via path alignment Roberto De Virgilio · Antonio Maccioni · Riccardo Torlone © Springer Science+Business Media New York 2014 Abstract A query over RDF data is usually expressed in terms of matching between a graph representing the target and a huge graph representing the source. Unfortunately, graph matching is typically performed in terms of subgraph isomorphism, which makes semantic data querying a hard problem. In this paper we illustrate a novel technique for querying RDF data in which the answers are built by combining paths of the underlying data graph that align with paths specified by the query. The approach is approximate and generates the combinations of the paths that best align with the query. We show that, in this way, the complexity of the overall process is significantly reduced and verify experimentally that our framework exhibits an excellent behavior with respect to other approaches in terms of both efficiency and effectiveness. Keywords Path · Graph · RDF · Approximate matching · Alignment 1 Introduction The Web 3.0 aims at turning the Web into a global knowledge base, where resources are identified by means of URIs, semantically described with RDF, and related through RDF statements. This vision is becoming a reality by the spread of Semantic Web Communicated by Haixun Wang and Jeffrey Xu Yu. R. De Virgilio (B ) · A. Maccioni · R. Torlone Università Roma Tre, Rome, Italy e-mail: [email protected]; [email protected] A. Maccioni e-mail: [email protected] R. Torlone e-mail: [email protected] 123

Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel DatabasesDOI 10.1007/s10619-014-7142-1

Approximate querying of RDF graphs via pathalignment

Roberto De Virgilio · Antonio Maccioni ·Riccardo Torlone

© Springer Science+Business Media New York 2014

Abstract A query over RDF data is usually expressed in terms of matching between agraph representing the target and a huge graph representing the source. Unfortunately,graph matching is typically performed in terms of subgraph isomorphism, which makessemantic data querying a hard problem. In this paper we illustrate a novel technique forquerying RDF data in which the answers are built by combining paths of the underlyingdata graph that align with paths specified by the query. The approach is approximateand generates the combinations of the paths that best align with the query. We showthat, in this way, the complexity of the overall process is significantly reduced andverify experimentally that our framework exhibits an excellent behavior with respectto other approaches in terms of both efficiency and effectiveness.

Keywords Path · Graph · RDF · Approximate matching · Alignment

1 Introduction

The Web 3.0 aims at turning the Web into a global knowledge base, where resources areidentified by means of URIs, semantically described with RDF, and related throughRDF statements. This vision is becoming a reality by the spread of Semantic Web

Communicated by Haixun Wang and Jeffrey Xu Yu.

R. De Virgilio (B) · A. Maccioni · R. TorloneUniversità Roma Tre, Rome, Italye-mail: [email protected]; [email protected]

A. Maccionie-mail: [email protected]

R. Torlonee-mail: [email protected]

123

Page 2: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

technology and the availability of more and more linked data sources. However, therapid increase of semantic data raises in this context severe data management issues [1,2]. Among them, a major problem lies in the difficulty for users to find the informationthey need in such huge and heterogeneous repository of semantic data.

In this scenario, approaches to approximate query processing are increasingly cap-turing the attention of researchers [3–7] since they relax the matching between queriesand data, and thus provide an effective support to non-expert users, who are usuallyunaware of the way in which data is organized. A support to approximate queryprocessing is particularly relevant in the context of linked data in which usually datado not strictly follow the ontology of reference and therefore queries posed againstthe schema may not retrieve valid answers.

Since semantic data have a natural representation in the form of a graph, this prob-lem has been often addressed in terms of approximate matching between a smallgraph Q representing the query and a very large graph G representing the database.The usual approach to this problem is based on searching all the subgraphs of G thatare isomorphic to Q. Unfortunately, this problem is known to be NP-complete [8]and the problem is even harder if the matching between query and data is approx-imate. For this reason, the various approaches to approximate query processing ongraph databases rely on heuristics, based on similarity or distance metrics, on theuse of specific indexing structures to reduce the complexity of the problem [4,6,9],and on fixing some threshold on the maximum number of hops (i.e., node/edge addi-tions/deletions needed to perfectly match the query graph with the underlying graphdatabase) that are allowed [5]. Moreover, given that the set of answers to a query ispotentially very large, a mechanism that aims to efficiently select the “best” k answersis desirable.

In this framework, we propose a novel technique for querying graph-shaped datain an approximate way that combines a strategy for building possible answers with aranking method for evaluating the relevance of the results as soon as they are computed.The goal is the generation of the best results in the first retrieved answers, thus avoidingthe computation of all the candidate answers. We focus in particular on Basic GraphPattern queries [7], which basically express conjunctive queries on graph data models,over RDF data. RDF is the “de-facto” standard language for the representation ofsemantic information: it encodes Web data as a labeled directed graph in which thenodes represent the resources and values (also called literals), and links representsemantic relationships between resources. A resource is uniquely identified in theSemantic Web with a URI.

Example 1 Let us consider the graph Gd depicted in Fig. 1, taken from [4]: it representsa simplified portion of the GovTrack

1, a database that stores events that occurredin the US Congress. In RDF graphs, nodes represent RDF classes, literals, or URIs,whereas edges represent RDF properties.

Assume that a user needs to know all amendments sponsored by Carla Bunes toa bill on the subject of Health Care that was originally sponsored by a male person.Queries Q1 and Q2 in Fig. 1 are two possible ways to express this information need.

1 http://www.govtrack.us.

123

Page 3: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

(a)

(b)

(c)

Fig. 1 An example of data and query graph

They only differ by the presence of an “optional” node and an “optional” edge. WhileQ1 has an exact matching over Gd , a perfect matching algorithm would retrieve anempty result for Q2 over Gd . Conversely, in the context of RDF data, it would bedesirable to provide an answer also to Q2.

Usually, different paths of the query graph denote different relationships betweennodes. For instance, the edges of Q1 indicate that Male is the gender of someonesponsoring something on the subject Health Care. This simple observation suggeststhat query answering can proceed as follows: first, the query is decomposed into a setof paths that start from a source and end into a sink, then those paths are matchedagainst the data graph, and finally the data paths that best match the query paths arecombined to generate the answer.

123

Page 4: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

In our example, this method would decompose Q1 in the following paths:

pq1 : Carla Bunessponsor−−−−→ ?v1

aT o−−→ ?v2subject−−−−→ Health Care

pq2 : ?v3sponsor−−−−→ ?v2

subject−−−−→ Health Care

pq3 : ?v3gender−−−−→ Male

from them, the following paths of Gd would be selected:

pd1 : Carla Bunessponsor−−−−→ A0056

aT o−−→ B1432subject−−−−→ Health Care

pd2 : Pierce Dickenssponsor−−−−→ B1432

subject−−−−→ Health Care

pd3 : Pierce Dickensgender−−−−→ Male

and those paths, suitably combined, form the answer to the query.Therefore, we tackle the problem of querying RDF graphs by finding the best

combinations of the paths of the data graph that best align with the paths of the querygraph. Note that in the example above the result is an exact answer to Q1, but the samestrategy can be adopted to generate approximate answers to queries with a suitablerelaxation of the notion of alignment between graph paths and data paths. Actually,by using this technique, the same answer of Q1 is returned to the query Q2 in Fig. 1,for which there is indeed no exact answer.

The query processing phase first extracts all the paths of data graph G that alignwith the paths of a query graph Q taking advantage of a special index structure that isbuilt off-line. During the construction, a score function evaluates the answers in termsof quality and conformity. The former measures how much the paths retrieved alignwith the paths in the query. The latter measures how much, in G, the combination ofpaths retrieved is similar to the combination of the paths in the query. Such strategyexhibits, in the worst case, a polynomial time complexity in the number of nodes ofthe data graph and our experiments show that the technique scales seamlessly with thesize of the input.

In order to test the feasibility of our approach, we have developed a system2 forquerying RDF data that implements the above described technique. A number ofexperiments over widely used benchmarks have shown that our technique outperformsother approaches, in terms of both effectiveness and efficiency.

The rest of the paper is organized as follows. In Sect. 2 we introduce some prelimi-nary notions and definitions. In Sect. 3 we illustrate our strategies for graph matchingover RDF data and in Sect. 5 we present the experimental results. In Sect. 6 we discussrelated work and finally, in Sect. 7, we draw some conclusions and sketch some futurework.

2 A prototype application is available at https://www.dropbox.com/sh/d5u1u24qnyqg18f/7Oefq8-qVa.

123

Page 5: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

2 Preliminary issues

This section states the problem we address in this paper and introduces some prelim-inary notions and terminology. We start with the definition of an answer to a queryin our context and then introduce the data structures and the scoring function that areused in our technique to build and rank the answers to queries.

2.1 Problem definition

A graph is a 4-tuple G = 〈N , E, L N , L E 〉 where N is a set of nodes, E ⊆ N × N isa set of ordered pairs of nodes, called edges, and L N and L E are injective functionsthat associate an element of a set of node labels�N to each node in N and an elementof a set of edge labels �E to each edge in E , respectively.

We focus our attention on (possibly large) RDF databases, which are conceptuallyconceived as labeled directed graphs in which the nodes represent either resources orvalues while the edges relate resources to resources and resources to values. We thenintroduce the following notion. Let U be a set of URIs and L be a set of literals.

Definition 1 (Data Graph) A data graph Q is a graph where �N = U ∪ L and�E = U .

Let VAR be a set of variables, denoted by the prefix “?”. A query graph Q is a datagraph in which the nodes can be labeled with variables.

Definition 2 (Query Graph) A query graph Q is a graph where �N = U ∪ L ∪ VARand �E = U ∪ VAR.

The evaluation of a query consists on retrieving portions of the data graph thatmatch with the query graph. This process can be relaxed by assuming that before thematch, the query graph can be slightly transformed, as formalized in the following.

A substitution for a query graph Q is a function that maps the variables in Q toeither URIs or literals. A transformation τ on a query graph is a sequence of thefollowing basic update operations: node and edge insertion, node and edge deletion,and labeling modification of both nodes and edges.

Definition 3 ( Query Answer) An (approximate) answer to a query graph Q over adata graph G is a subgraph G ′ of G for which there exists a substitution φ and atransformation τ such that G ′ = τ(φ(Q)). If τ is the identity function, G ′ is an exactanswer to Q.

In our implementation, the operation of labeling modification of the τ functionis based on standard libraries for testing the equality of values based on traditionaltechniques for full text search (such as stemming). This allows the matching betweenlabels such as fishing, fished, and fish. Since this aspect is outside the scope of thepaper, we will simply assume, hereinafter, that the operation of labeling modificationprovides a support for approximate matching between labels and we will use the termmatching between values in this sense, without discussing this aspect further.

123

Page 6: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

Intuitively, an answer a1 = τ1(φ1(Q)) is more relevant than another answer a2 =τ2(φ2(Q)) if τ1 contains a lower number of operations than τ2. Moreover, in the contextof RDF data in which nodes represent concepts and edges represent relationships, it isuseful to associate a weight of relevance to each basic update operation. For instance, itcould be reasonable that the modification of a label is less relevant than a node insertion,since the latter increases the semantic distance between concepts. Therefore, let ω bea function that associates a weight of relevance to each basic operation�. We say thatthe cost γ of a transformation τ = �1 ◦ . . . ◦ �z is γ (τ) = z ·∑z

i=1(ω(�i )).

Definition 4 (Relevance of an Answer) An answer a1 = τ1(φ1(Q)) is more relevantthan another answer a2 = τ2(φ2(Q)) if γ (τ1) < γ (τ2).

Then, given a data graph G and a query graph Q, we aim at finding the top-kanswers a1, . . . , ak of Q according to their relevance.

2.2 Paths and computed answers

We now introduce a number of notions that, in our approach, are used in the construc-tion of the answers to a query. Given a graph G, we call start nodes, the nodes of Gwith no in-going edges, and end nodes, the nodes of G with no out-going edges.

Basically, a path in a graph G is a sequence of labels from a start node to an endnode of G. In the case of cycles, a path ends, intuitively, just before the repetition ofa node label. Moreover, if there is no start node in G, a path starts from nodes whosedifference between the number of outgoing edges and the number of the incomingedges is maximal in G. We call these nodes hubs.

Definition 5 (Path) Given a data graph G = 〈N , E, L N , L E 〉, a path is a sequencep = ln1 − le1 − ln2 − · · · − lek−1 − lnk where: (i) lni = L N (ni ), lei = L E (ei ), andni ∈ N , ei = (ni , ni+1) ∈ E , (ii) n1 is either a start node or, if G has no start nodes, ahub, and (iii) nk is either an end node or a node such that there is no edge (nk, nk+1)

such that L N (nk+1) (the label of nk) already occurs in p.

In the following, give a path p = ln1 − · · · − lnk , we will call n1 and nk the sourceand the sink of p, respectively. The length of a path is the number of nodes occurringin the path, while the position of a node corresponds to its position among the nodesin the path.

Let us consider for instance the data graph G in Fig. 2. This graph has two loops,no start nodes, no end nodes, and one hub (n2). It follows that the paths of G are:p1 = n2 − e2 − n1, of length 2, and p2 = n2 − e3 − n3 − e4 − n1, of length 3.

Let us now consider the data graph in Fig. 1. It has seven start nodes (the double-marked nodes) and two end nodes (Health Care and Male, marked in gray). An exampleof path is:

pz = JR-sponsor-A1589-aTo-B0532-subject− -HC

where JR and HC denote Jeff Ryser and Health Care, respectively. pz has length 4 andthe node A1589 has position 2. The query Q1 in Fig. 1 has the following paths:

123

Page 7: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

Fig. 2 A data graph with loopsand without start nodes

q1 : CB-sponsor-?v1-aTo-?v2-subject-HC

q2 : ?v3-sponsor-?v2-subject-HC

q3 : ?v3-gender-Male

Our technique tries to generate answers to a query Q by applying substitutions andtransformations to paths of Q. This operation is called alignment.

Definition 6 (Alignment) Given a data graph G and a query graph Q an alignment isa substitution φ and a transformation τ of a path p of Q such that τ(φ(p)) is a pathof G.

We are ready to introduce our notion of computed answer. We say that a set P ofpaths of a graph G is a connected component of G if, for each pair of paths p1, p2 ∈ P ,there is a sequence of paths [p1, . . . , p2] in P in which each element has at least anode in common with the following element in the sequence.

Definition 7 (Computed Answer) Given a query graph Q, a computed answer of Qover a data graph G is a set of alignments of all the paths of Q that forms a connectedcomponent of G.

Note that a computed answer of Q over a data graph G is indeed an answer of Qover G (Definition 3). For this reason, in the following we will often not make anydistinction between answer and computed answer, when it is clear from the contextthe notion we are referring to.

2.3 Scoring function

The function score is an approximate implementation of the general notion of rele-vance (Definition 4) that can be computed in linear time on the size of the data. Thefunction score simulates the relevance of computed answers ai by taking into accounttwo different aspects, quality and conformity. The former measures how much thepaths retrieved align with the paths in the query. The latter measures how much, inai , the combination of paths retrieved is similar to the combination of the paths in thequery.

123

Page 8: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

The first aspect that score considers is the quality of alignment between paths of acomputed answer ai and paths of a query Q as follows:

�(ai , Q) =∑

q∈Q

(λ(p, q))

In this formula, q is a path of Q, p is the path of ai that originates from an alignmentτ ◦ φ of q (that is, p = τ(φ(q))), and λ is a function defined as follows:

λ(p, q) = (a · n−N + b · n�

N )+ c · n−E + d · n�

E ) (1)

In this expression: (i) n−N and n−E are, respectively, the number of nodes and edgesof p that are not present in q, and (ii) n�

N and n�

E are, respectively, the number ofnodes and edges inserted in q by τ . Finally, a, b, c and d are parameters that serve totake into account the weights of relevance of the operations in τ (see Definition 4).

The second aspect that the score function considers is the conformity between thecombination of the paths in the computed answer and the combination of the paths inthe query. This is evaluated as follows:

Ψ (ai , Q) =∑

qi ,q j∈Q

(ψ(qi , q j , pi , p j ))

In this equation, qi and q j are paths of Q, pi and p j are paths of ai that originatefrom alignments τi ◦ φi and τ j ◦ φ j of qi and q j respectively (that is, pi = τi (φi (qi ))

and p j = τ j (φ j (q j ))), and ψ is a function defined as follows:

ψ(qi , q j , pi , p j ) ={

e · |χ(qi ,q j )||χ(pi ,p j )| , if |χ(pi , p j )| > 0

e · |χ(qi , q j )|, if |χ(pi , p j )| = 0

where χ is a function that associates with each pair of paths (p1,p2) the set of nodes incommon between p1 and p2. It follows thatψ(qi , q j , pi , p j ) returns the ratio betweenthe sizes of χ(pi , p j ) and χ(qi , q j ). Finally, e is a parameter that serves to take intoaccount the weight of the conformity in score. This is very useful in those applicationswhere the topology and the interconnections within the answers are very important(e.g., in social network analysis), sometimes even more than the content of the datathemselves.

The final score function is then computed as:

score(ai , Q) = �(ai , Q)+ Ψ (ai , Q)

It turns out that, with a suitable choice of the parameters in Eq. (1) that considersthe weights of relevance assigned to the basic operations, score is coherent with thenotion of relevance of an answer, that is, for each pair of answers a1 and a2 for a queryQ such that a1 is more relevant than a2 we have that score(a1, Q) < score(a2, Q).

123

Page 9: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

Theorem 1 Given a query graph Q and a data graph G, for each pair of com-puted answers ai and a j for Q over G, if ai is more relevant than a j then we havescore(ai , Q) < score(a j , Q).

Proof Let ��

N , �−N and �×N be basic update operations of node insertion, nodedeletion, and labeling modification, respectively. Analogously, �−E ,��

E and �×Eare the respective operations on edges. On these operations, we fix the function ω:(i) ω(�−N ) = a, (ii) ω(��

N ) = b, (iii) ω(�−E ) = c and (iv) ω(��

E ) = d. We con-sider, as in other works [10], ω(�×N ) = 0 and ω(�×E ) = 0 because we do not want topenalize the case where the answer gathers more labels than Q.

Now, let us count the number of basic update operations in a transformation τi foran answer ai . In this case: n−N and n−E are, respectively, the number of nodes and edgesof ai that are inserted in Q, and n�

N and n�

E are, respectively, the number of nodesand edges updated in Q by τi .

The cost of γ (τi ) is n−N · a + n�

N · b + n−E · c + n�

E · d. Let a1 = τ1(φ1(Q)) anda2 = τ2(φ2(Q)) be two answers over a query Q. Considering, from Definition 4, thata1 = �1

1 ◦ . . . ◦ �1z is more relevant than a2 = �2

1 ◦ . . . ◦ �2y , we have that

γ (τ1) < γ (τ2) (2)

But for a path p ∈ ai we have that γ (τi ) = λ(p, Q). Then, if we generalize Eq. (2)to all the paths of the two answers a1 and a2 we obtain

�(a1, Q) < �(a2, Q) (3)

that satisfies the hypothesis for the first aspect of score. With score we have an upperbound of the notion of relevance because if a node has a mismatch in Q and it isin common among more than one path, then it gets counted more than once in n−Nand n�

N . The conformity Ψ (ai , Q) follows a similar trend than �(ai , Q). In fact,more are the mismatching elements (i.e. nodes and edges) and lower is the numberof common nodes. Consequently, the number of intersections between the paths in ananswer conforming to the intersections between the paths of the query is also lower.In our case we have that

Ψ (a1, Q) < Ψ (a2, Q). (4)

Given Eqs. (3) and (4), it follows that score(a1, Q) < score(a2, Q). ��

2.4 Computation of alignments

In order to measure� in score we have to compute alignments between paths in a andpaths in Q. Therefore, a and Q are first decomposed into a set of paths. Then, the pathsof a are aligned against paths of Q. In our example, this method would decomposeQ1 in the following paths:

123

Page 10: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

q1 : Carla Bunessponsor−−−−→ ?v1

aT o−−→ ?v2subject−−−−→ Health Care

q2 : ?v3sponsor−−−−→ ?v2

subject−−−−→ Health Care

q3 : ?v3gender−−−−→ Male

Now assume that a1 is extracted from Gd that is formed by the following paths:

p1 : Carla Bunessponsor−−−−→ A0056

aT o−−→ B1432subject−−−−→ Health Care

p2 : Pierce Dickenssponsor−−−−→ B1432

subject−−−−→ Health Care

p3 : Pierce Dickensgender−−−−→ Male

Note that in the example above the result is an exact answer to Q1, but the samestrategy can be adopted to compare paths that are not exactly matches and thus, witha lower degree of similarity.

We can now compute the quality of alignments between the paths p in a1 andthe paths q in Q. They are done by inserting, deleting and modifying nodes in qby proceeding with a scan contrary to the direction of the edges. For instance let usconsider q1 and q2 from the query graph Q1 and the path

p = CB-sponsor-A0056-aTo-B1432-subject-HC

from Gd . We evaluate the score of p with respect to both q1 and q2 as follows

q1 : CB-sponsor-?v1-aTo-?v2-subjec-HC

τ1(φ1(q1)) : CB︸︷︷︸ -sponsor- A0056︸ ︷︷ ︸ -aTo- B1432︸ ︷︷ ︸ -subject-HC

q2 : ?v3-sponsor-?v2-subject-HC

τ2(φ2(q2)) : CB︸︷︷︸ -sponsor- A0056︸ ︷︷ ︸ - aTo-B1432︸ ︷︷ ︸ -subject-HC

In this case q1 requires only a substitution φ on the variables while q2 employs atransformation τ to insert ATO-B1432 and a substitutionφ on the variables. In the formercase we have λ(p, q1) = (0+0)+ (0+0) = 0, since n−N = n�

N = n−E = n�

E = 0. Inthe latter case λ(p, q2) = (0+b)+ (0+d), since n−N = n−E = 0, and n�

N = n�

E = 1.If we set b = 0.5 and d = 1, we have λ(p, q2) = 1.5 (i.e. p has the best alignmentwith q1). In the same way, given

p′ = JR-sponsor-A1589-aTo-B0532-subject-HC

we have λ(p′, q1) = (a + 0) + (0 + 0), since n−V = 1 due to the mismatch betweenCB and JR. If we set a = 1, we have that λ(p′, q1) = 1 (i.e. q1 has a better alignmentwith p than p′).

It is straightforward to demonstrate that the time complexity of the alignment isO(I ) where I = |p| + |q| is the sum of the nodes and edges of the paths in p and q.

123

Page 11: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

3 Path-based query processing

3.1 Overview

Let G be a data graph and Q a query graph on it. The approach is composed of twomain phases: the indexing (done off-line), in which all the paths of G are indexed, andthe query processing (done on-the-fly), where the query evaluation takes place. Thefirst task will be described in more detail in Sect. 5.

In the second phase, the set PD of the paths of Q are first retrieved by exploitingthe index and then PD is used to retrieve the best answers by adopting a strategy thatguarantees a polynomial time complexity with respect to the size of PD. This task isperformed by means of the following main steps:

Preprocessing Given a query graph Q, in this step the set PQ of all paths of Q iscomputed on the fly by traversing Q. We exploit an optimized implementation ofthe breadth-first search (BFS) traversal. The elements of PQ are organized in thethe so-called intersection query graph (I G). Nodes of I G are paths of Q while anedge (qi , q j )means that qi and q j have nodes in common. For instance, referringto Example 1, PQ consists of the following paths.

q1 : Carla Bunes-sponsor-?v1-aTo-?v2-subject-Health Careq2 : ?v3-sponsor-?v2-subject-Health Careq3 : ?v3-gender-Male

The intersection query graph built from q1, q2 and q3 is depicted in Fig. 3. Thisdata structure keeps track of the fact that q1 and q2 have nodes in common (?v2and Health Care) and that q2 and q3 have also nodes in common (only ?v3).Clustering In the second step we build a cluster for each element q of PQ. Then,we group in the same cluster all the paths p of G having a sink that matches the sinkof q. If a variable occurs in the sink of q, we retrieve the last value v occurring in qand we group in the same cluster all the paths p of G containing a label matchingv. Before the insertion of a path p in the cluster for q, we evaluate the alignmentneeded to obtain p from q. This allows us to compute the score of p, i.e. λ(p, q).The paths in a clusters are ordered according to their score, with the lower comingfirst. Note that the same path p can be inserted in different clusters, possibly witha different score. As an example, given the data graph Gd and the query graph Q1of Fig. 1, we obtain the clusters shown in Fig. 4. In this case clusters cl1, cl2 andcl3 correspond to the paths q1, q2 and q3 of PQ, respectively; note the scores atthe right side of each path and in particular the path p1 occurring in both cl1 andcl2 with different scores, i.e. 7 in cl1 and 5 in cl2.

Fig. 3 An example of intersection query graph

123

Page 12: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

Fig. 4 An example of the clustering step

Fig. 5 Forest of paths

Search The last step aims at generating the most relevant answers by combiningthe paths in the clusters built in the previous step. This is done by picking andcombining the paths with lowest score from each cluster. The intersection querygraph allows us to verify efficiently if they form an answer. As an example, giventhe clusters in Fig. 4, the first answer is obtained by combining the paths p1, p10and p20 that are the elements with the greatest score in each corresponding clusterand provide the best alignment with the paths of PQ associated with the clusters.

The most tricky task of the whole process occurs in the third step above, where wetry to generate the top-k answers by minimizing the number of combinations betweenpaths. This is done by organizing the combinations of paths in a forest where nodesrepresent the retrieved paths, while edges between paths means that they have nodesin common. The label of each edge (pi , p j ) is 〈(qi , q j ) : [ψ(qi , q j , pi , p j )]〉 whereqi and q j are the paths corresponding to the clusters where pi and p j were included,respectively.

For instance, Fig. 5 reports the forest for the paths with the higher score extractedfrom the clusters in Fig. 4. The label on the edge (p10, p1) indicates that if p10 andp1 originate from q2 and q1, respectively, then ψ(q2, q1, p10, p1) is 1. Conversely,

123

Page 13: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

the label on the edge (p7, p1) indicates that ψ(q2, q1, p7, p1) is 0.5. Note that in theforest the edge (p7, p1) is dashed since the label is not 1. The tree in the forest withnodes p1, p10 and p20 yields the first answer.

The rest of the section describes in more detail the clustering and search steps ofthe approach.

3.2 Clustering

Given the intersection query graph built in the previous steps (i.e. also the set PQ)and a data graph G, we retrieve and group the paths from G ending into the sinks ofthe paths of PQ, as shown in Algorithm 1.

Algorithm 1: Clustering of pathsInput : The query paths PQ, the data graph G.Output: The set of clusters CL.

foreach q ∈ PQ do1cl← ∅ ;2PD← getPaths(G, q) ;3foreach p ∈ PD do4

cl.enqueue(p, λ(p, q))5

CL.put(q, cl) ;6

return CL ;7

The set CL of clusters is implemented as a map where the key is a path q from PQand the value is a cluster with all the paths p ending in the sink of q. Each clusteris implemented as a priority queue of paths, where the priority is based on the scoreassociated with each path (in ascending order). In this paper, we use an implementa-tion of priority queues to guarantee a constant time complexity for insertion/deletionoperations. For each q ∈ PQ [line 1] we extract the sink sk of q and we retrieve allp from G matching sk. Such task is performed by the function getPaths [line 3].Once we have obtained the set PD, we evaluate the score of each p ∈ PD withrespect to q and we insert p in the cluster cl [lines 4–5]. Finally, we insert cl in CL[line 8].

3.3 Search

Given the set of clusters CL, we retrieve the top-k answers by generating the connectedcomponents from the most promising paths in CL. Algorithm 2 summarizes the entireprocess of search aimed at retrieving the top-k answers.

As long as we did not generate k answers and the set of clusters is not empty [line1], we build a forest F [line 2] from the most promising paths in CL and we providethe top-k answers by visiting F [lines 4–8]. As we have said in Sect. 3.1, nodes of Fdenote paths in CL, while edges (pi , p j ) represent the fact that pi and p j have nodes

123

Page 14: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

Algorithm 2: SearchInput : number k, clusters CL, intersection query graph I G.Output: The set of answers S.

while (CL �= ∅) ∧ (|S| < k) do1F ← build( CL, I G );2roots← F .keys;3while (|S| < k) ∧ (roots �= ∅) do4

V← ∅;5r← getMax(roots );6roots← roots− {r};7a← Visit(r,getQPath(CL, r ),V,F .get(r ));8S.add(a );9

return S ;10

in common. The trees of F are then returned as answers to the query. In the followingwe describe in more detail the building and the visiting of F .

Building CL and the intersection query graph I G, first of all we have a buildingphase generating a forest of paths, as shown in Algorithm 3.

Algorithm 3: BuildInput : clusters CL, intersection query graph I G.Output: a forest F .

q ← maxCardinality(I G);1cl← CL.get(q);2PD← cl.dequeueTop();3V← {q};4F ← ∅;5foreach p ∈ PD do6

T← ∅;7T.N ← T.N ∪ {p};8treeBuild(p,T,CL, I G, q,V);9F .put(p,T);10

return F ;11

I G is used to evaluate the conformity of the answers while we build them. Algo-rithm 3 implements a BFS traversal. Therefore we start from the node q of I G withthe maximum cardinality [line 1]. For instance, referring to Q1 of Example 1, thealgorithm selects q2 as starting node. Then we select the cluster cl corresponding to q[line 2] and we dequeue the top paths PD from cl [line 3]. This task is supported bythe function dequeueTop. The path q is inserted in the set V of visited query paths[line 4]. Referring to our example the cluster cl2 corresponds to q2. In this case thetop paths to dequeue are p7, p8, p9 and p10. The paths PD extracted from the clusterrepresent the roots of the forest F . We implement F as a map where the keys are paths,i.e. the roots, and the values are the trees of F . Each tree T of F is modeled as a graph〈N , E〉 where the nodes in N are paths and each edge l(ni , n j ) ∈ E is described in

123

Page 15: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

terms of 〈ni , n j , l〉 where l is the label of the edge. For each p ∈ PD, we build a treerooted in p by using the procedure treeBuild [line 9]. The procedure is described indetails in Algorithm 4.

Algorithm 4: treeBuildInput : root r, tree T, clusters CL, query path q, intersection query graph I G, visited V.

L← ∅;1foreach (q, q ′) ∈ I G.E : q ′ �∈ V do2

cl← CL.get(q ′);3PD← cl.dequeueTop();4foreach p ∈ PD : p↔ r do5

T.V ← T.V ∪ {p};6e← 〈r, p, (q, q ′) : [ψ(q, q ′, r, p)]〉;7T.E ← T.E ∪ {e};8L← L ∪ {〈p, q ′〉};9

V← V ∪ {q ′};10

foreach 〈p, q ′〉 ∈ L do11treeBuild(p,T,CL, q ′, I G,V);12

The procedure treeBuild starts the navigation of I G from the input query path q.For each edge (q, q ′), where q ′ is not yet visited, we dequeue the top paths PD fromthe cluster cl corresponding to q ′ [lines 3–4]. Then for each path p ∈ PD having nodesin common with r (denoted by p↔ r), we build the edge (r, p) and the correspondinglabel 〈r, p, (q, q ′) : [ψ(q, q ′, r, p)]〉, [line 7], and we insert it in the set E of T [line8]. The pair 〈p, q ′〉 is added to the set L [line 9]. Finally, we include q ′ in the set Vof visited query paths [line 10] and we recursively call the procedure for each p [line12].

As an example, Fig. 6 illustrates the building of the forest F with respect to Exam-ple 1. Starting from p7, p8, p9 and p10 (i.e. the roots) extracted by the cluster cl2(i.e. q2), we traverse I G: we consider q1 and q3. Traversing q1 we dequeue the top

Fig. 6 Building of the forest

123

Page 16: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

paths from the cluster cl1: in this case we have only p1; p1 is the successor of allroots since it has nodes in common with them. Therefore, we add an edge betweeneach root and p1 in F . However each edge presents a different conformity: p9 and p10provide two nodes in common with p1 conforming exactly with the query graph Q1(i.e. 1); while p7 and p8 have one node in common p1 conforming partially with Q1(i.e. 0.5). Similarly, in the traversal of q3 we dequeue p17, p18, p19 and p20 from cl3.In this case only p17 and p20 have nodes in common with p7 and p10, respectively,with conformity 1. Once it has added the last edges to F , the procedure terminatessince it visited all the nodes of I G.

The building step (Algorithm 3) is the more involved of the entire process. Algo-rithm 3 requires I iterations, where I is the number of paths PD retrieved in line 3,which, in the worst case, are all the paths of the data graph. The most involved opera-tion in the body of the loop in Algorithm 3 is a call to Algorithm 4 whose goal is theconstruction of a tree T , where the nodes represent paths of the data graph G and theedges connect paths having a non-empty intersection. The algorithm is made of twoparts (lines [2–10] and [11–12]). In the first part there is a nested loop. The externalloop iterates over the query paths in the query graph, which are at most h. The internalloop iterates over the paths in the data graph that align with the paths in the query,which are at most I . It follows that the first part requires, at most, h× I steps. At eachstep, a pair 〈p, q ′〉, where p is a path of the data graph that aligns with the path q ′ inthe query graph, is added to the set L. The second part of the algorithm iterates overthe elements of L, which are at most I . In each iteration, there is a recursive call tothe algorithm. In the base case, the query graph consists in just one path and the costis constant since no iteration is required both in the first and in the second part of thealgorithm. In the general case, since at each call of the algorithm we consider onlythe paths of the query graph that have not been visited yet (line 2), we have that thecost C of the algorithm can be expressed as: C(h, I ) = (I × h) + I × C(h − 1, I ).Therefore, we have that the complexity of Algorithm 4 is the solution of the followingequation:

C(h, I ) ={

I × C(h − 1, I )+ I × h, h > 11, h = 1

From this equation, it follows that the complexity of Algorithm 4 is in O(I h−1)

and since Algorithm 3 requires I iterations of Algorithm 4, the overall complexity ofAlgorithm 3 is in O(I × I h−1) ∈ O(I h).

Visiting The last step consists in a visit of the forest F . Algorithm 2 starts the visitfrom the root with the maximum score. The visit implements a Depth-first search(DFS) traversal as shown in details in Algorithm 5.

As in the building step, we exploit the intersection query graph to explore F .Starting from a root p of F to which the path q is associated (i.e. p was in the clustercorresponding to q), for each (q, q ′) in I G we select the successor p′ of p to whichq ′ is associated. In particular, since we can have multiple p′ to which q ′ is associated,we select the most conforming path p′ with the getPathMaxConformity [line 5]. Wethen include p′ in the answer a and we call recursively the algorithm [line 7]. If p hasno successors, then we return p.

123

Page 17: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

Algorithm 5: VisitInput : path p, path q, visited V, tree T.Output: set of paths a.

if �〈p, p′, l〉 ∈ T.E then1return {p};2

else3foreach (q, q ′) ∈ I G.E : q ′ �∈ V do4

p′ ← getPathMaxConformity(T, p,q,q ′);5V← V ∪ {q ′};6return {p} ∪ Visit(p′,q ′,V,T );7

Table 1 Overall complexityClustering Building Visiting Search Overall

|Q| × O(I ) O(I h) O(hh−1 × I ) O(k × I h) O(k × I h)

Referring to Fig. 6, we start from roots = {p10, p7, p9, p8} (i.e. in order of priority).Then the first answer a1 dequeues p10. From p10, we add p20 and p1 to a1, that arethe most important paths to which q3 and q1 are associated. Finally, we have a1 ={p10, p1, p20}. Similarly we generate, in order, a2 = {p7, p1, p17}, a3 = {p9, p1}and a4 = {p8, p1}.

In the base case of Algorithm 5 [lines 1–2] either the query graph T has one nodeonly. In this case, the cost is constant since no iteration is performed. In the generalcase, as for the building, the visit explores the graph I G and iterates at most h times[lines 4–7]. In each iteration there is a call to function getPathMaxConformity, whichhas a cost in O(I ) since it checks the conformity of all the children of p in T, whichare at most I . Then, there is a recursive call to the algorithm and so, since at each callwe consider only the paths of the query graph that have not been visited yet [lines 6–7],the cost can be expressed, in the general case, as C(h, I ) = h × (I + C(h − 1, I )).Therefore, the computational time complexity of Algorithm 5 is the solution of thefollowing equation:

C(h, I ) ={

h × (I + C(h − 1, I )), h > 11, h = 1

From this equation, it follows that the complexity of Algorithm 5 is in O(hh−1 × I ).Overall complexity Table 1 summarizes all the discussed complexity. The overall

process consists of two main phases: clustering (Algorithm 1) and search (Algo-rithm 2). In turn, the search algorithm consists of two main parts: building (Algo-rithm 3) and visiting (Algorithm 5). As we have said above, search is the core ofthe overall process and iterates at most k times, where k is the number of answers toreturn. In each iteration, there is a call to building and, at most, I calls to visiting.Therefore O(search) ∈ k×O(building)+ k× I ×O(visiting) ∈ O(k× I h), whereI is the number of paths retrieved and h is the number of paths in Q. It turns out thatthe overall complexity is bounded by the complexity of the search phase. Note that the

123

Page 18: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

complexity is exponential in the size of the query, which is usually very limited withrespect to the size of data. On the other hand, we have that the maximum number ofpaths in a data graph G (see Definition 5) is proportional to n2 where n is the numberof nodes of G. This is because, in the worst case, we have n/2 sources and n/2 sinkswith edges between each source and each sink, and so we have n2/4 paths. It followsthat our technique exhibits a polynomial time complexity in the size of the data. Notethat, as soon as I tends to n2, the depth of the tree computed in the build phase tendsto 1, because the graph becomes strongly connected. In this case, the complexity ofsearch reduces to O(k × I ). We also point out that, in Sect. 5 we show that ourtechnique improves other approaches in terms of efficiency over real world data setsand it scales seamlessly with the size of the input.

4 Implementation

We have implemented our approach in Sama3, a Java system with a Web front end.

To build answers efficiently, we index the following information: vertices’ andedges’ labels of the data graph G (for element-to-element mapping) and the pathsending into sinks, since they bring information that might match the query. The firstinformation enables to locate vertices and edges matching the labels of the query graph,the second allows us to skip the expensive graph traversal at runtime. The indexingprocess is composed of three steps: (i) hashing of all vertices’ and edges’ labels, (ii)identification of sources and sinks, and (iii) computation of the paths. The first and thesecond step are relatively easy. The third step requires to traverse the graph startingfrom the sources and following the routes to the sinks. We have implemented anoptimized version of the BFS paradigm, where independently concurrent traversalsare started from each source. Similarly to [4], and differently from the majority ofrelated works (e.g., [5]), we assume that the graph cannot fit in memory and that canonly be stored on disk. Such algorithm allows to retrieve a polynomial number ofpaths; of course it is not the complete set of paths between sources and sinks, but asshown in our experimentations such set allows a good effectiveness. Specifically, westore the index in a GraphDB, that is HyperGraphDB [11] (HGDB) v. 1.1: it modelsdata in terms of hypergraphs. Let us recall an hypergraph H is a generalization ofa graph, where an edge can connect any number of vertices. Formally H = (X, E)where X is a set of nodes or vertices, and E is a set of non-empty subsets of X , calledhyperedges. In other words, E is a subset of the power set of X . This representationallows us to define indexes on both vertices and hyperedges: X = {xm |m ∈ M} andE = {e f | f ∈ F, e f ⊆ X}, where each vertex xm and edge e f are indexed by an indexm ∈ M and f ∈ F , respectively. Figure 7 shows an example of reference.

Physically HGDB implements several HGHandle to wrap and to index nodes andedges of G; each subset of nodes (i.e. hyperedge) is implemented into a HGEdgeLinkhaving several target links to the HGHandle of contained nodes (i.e. a hyperedge isimplemented as a list of cursors to the contained nodes). In our framework, each pathis modeled as a HGEdgeLink in HGDB. The matching is supported by standard IR

3 A prototype application is available at https://www.dropbox.com/sh/d5u1u24qnyqg18f/7Oefq8-qVa.

123

Page 19: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

Fig. 7 An example to represent a data graph G (left side) in a hypergraph H (right side)

engines (c.f. Lucene Domain index—LDi4) embedded into HGDB. In particular wedefine a LDi index on the labels of nodes and edges. In this way, given a label, HGDBretrieves all paths containing data elements matching the label in a very efficient way(i.e. exploiting the cursors). Further, semantically similar entries such as synonyms,hyponyms and hypernyms are extracted from WordNet [12], supported by LDi.

5 Experimental results

In our experiments we used several widely-accepted benchmarks for graph matchingevaluation. We have compared Sama with three representatives graph matching sys-tems: Sapper [6], Bounded [5] and Dogma [4]. Experiments were conducted on adual core 2.66 GHz Intel Xeon, running Linux RedHat, with 4 GB of memory and a2-disk 1Tbyte striped RAID array.

Indexing In our experiments we consider real RDF datasets, such as PBlog5, Gov-

Track , KEGG, IMDB [13], DBLP, and synthetic datasets, such as Berlin [14],LUBM [15] and UOBM [16]. Table 2 provides importing information for any dataset:number of triples, number of nodes (|H V | column) and number of generated hyper-edges (|H E | column) in HGDB, time to create the index on HGDB (t column) andmemory consumption on disk. In our case, building the index takes hours for largeRDF data graphs, due to the demanding traversal on the complete large graph, andrequires GB of memory resources on disk to store data and metadata. However ourframework benefits the high performance to retrieve data elements on HGDB, as shownin Table 3. The table illustrates the average response times to retrieve, given a label,a path p and all data elements (nodes and edges) associated with p. We performedcold-cache experiments (i.e. by dropping all file-system caches before restarting the

4 http://lucene.apache.org/.5 http://www-personal.umich.edu/~mejn/netdata/.

123

Page 20: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

Table 2 HyperGraphDBindexing

DG #Triples |H V | |H E | t Space

PBlog 50 K 1,5 K 96 K 1 sec 56 MB

GOV 1 M 280 K 330 K 4 min 340 MB

KEGG 1 M 300 K 606 K 7 min 700 MB

Berlin 1 M 320 K 700 K 10 min 910 MB

iMDB 6 M 900 K 3 M 47 min 1.2 GB

LUBM 12 M 1 M 15 M 102 min 12.9 GB

UOBM 12 M 1 M 15 M 102 min 12.9 GB

DBLP 26 M 4 M 17 M 441 min 23.6 GB

Table 3 Average time toretrieve a path

DG Cold (msec) Warm (msec)

IMDB 0.01 0.005

LUBM 0.04 0.008

DBLP 0.06 0.009

Table 4 Index maintenanceperformance

Insertion (ms) Deletion (ms) Update (ms)

Vertex 4.6 4.3 5.7Edge 11.4 22.1 6.3

various systems and running the queries) and warm-cache experiments (i.e. withoutdropping the caches).

On top of this index organization, to avoid to recompile the entire index on HGDB,we implemented also several procedures to support the maintenance: insertion, dele-tion and update of new vertices or edges in the data graph G. Such operations aredocumented in [17].

Update operations perform well, as demonstrated in Table 4. We inserted andupdated 100 vertices (edges) and then we deleted them. We measured the averageresponse time (ms) to insert/delete/update one vertex (edge). Once the index is builtthe first time, it can be maintained easily as the maintenance operations satisfy practicalscenarios of frequent update of the dataset.

Query execution In this experiment, for each indexed dataset we formulated 12queries in SPARQL of different complexities (i.e. number of nodes, edges and vari-ables). In the following we refer to the most huge datasets: in particular we will discussin depth results over LUBM dataset, that are the most representative. DBLP and iMDBprovide a very close behavior. Hence, we consider 12 queries from the LUBM bench-mark that provide results without involving reasoning.6 We ran the queries ten timesand we measured the average response time, in ms and logarithmic scale. Precisely,the total time of each query is the time for computing the top-10 answers, includ-ing any preprocessing, execution and traversal. We performed both cold-cache and

6 At https://www.dropbox.com/sh/d5u1u24qnyqg18f/7Oefq8-qVa you can find the complete set of queries.

123

Page 21: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

(a)

(b)

Fig. 8 Average response time on LUBM: bars refer each system, using different gray scales, i.e. fromSama , white bars, to Dogma , black bars

warm-cache experiments. To make a comparison with the other systems, we refor-mulated the 12 queries by using the input format of each competitor. In Sama weset the coefficients of the scoring function as follows: a = 1, b = 0.5, c = 2 andd = 1. Therefore we show the behavior of all systems in terms of number of triplesand complexity. The query run-times are shown in Fig. 8.

In general Bounded performs better than Dogma , while Sapper is less efficient.Sama performs very well with respect to all competitors: it is supported by the indexthat retrieves all needed data elements in an efficient way (i.e. skipping data graphtraversal at runtime and supporting parallel implementations). In particular Fig. 9shows the cumulative time percentage of each step (i.e. preprocessing, clustering,building and visiting) in our framework. In any query, the most amount of time isspent for the preprocessing step (i.e. 89 % of the cumulative amount of time in aver-age): we have to traverse the query graph Q and extract all paths, and to retrieve allpaths. Of course a decentralized environment could relevantly limit this time con-sumption. However time consumption percentages of the main steps show the effi-ciency of our search algorithm and demonstrate the feasibility of the approach: wecompute simple alignments between paths. The other datasets provide a very similarbehavior.

123

Page 22: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

Fig. 9 Cumulative time percentage of each step

Fig. 10 Flexibility of Sama on LUBM

Another aspect that we test is the scalability of our approach. In particular thisexperiment shows that the time of execution is quadratic with respect to size of dataand this is coherent with the polynomial time complexity of our technique (see Sect. 3).Figure 10 shows the flexibility of Sama with respect to both the number I of pathsextracted from G and the number h of paths extracted from Q. For each couple (I, h)we depict the average response time (in msec) referring to cold-cache experiments:the number is enclosed in a circle and are scaled proportionally to the number (i.e.warm-cache experiments follow the same trend). The size of each circle is perfectlylinear with the growth of both I and h: this tells us that our approach is almost linearwith respect to both the measures. Such aspect is analyzed in more detail by evaluatingthe scalability of Sama with respect to I , h, the number of nodes and the number ofvariables in Q, through distinct diagrams, as illustrated in Fig. 11 (i.e. it refers to cold-cache experiments). The diagram displays the trend-line and the interpolated equation:in any case, the behavior of Sama is polynomial with respect to the size of data; inparticular the trend-line always approximate a quadratic trend (i.e. we have the samefor warm-cache experiments).

123

Page 23: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

y = -6E-08x2 + 0,0113x + 173,19

300

350

400

450

500

550

600

650

700

1,4 2,4 3,4 4,4 5,4 6,4 7,4 8,4 9,4 10,4

msec

I = #extracted paths from G (unit 100.000)

(a)

y = -1,0242x2 + 34,57x + 349,73

300

350

400

450

500

550

600

650

700

0 2 4 6 8 10 12 14 16 18

mse

c

h = # extracted paths from Q

(b)

y = -0,6909x2 + 29,646x + 325,17

300

350

400

450

500

550

600

650

700

3 8 13 18 23

mse

c

# nodes in Q

(c)

y = -7,1786x2 + 92,679x + 346

300 350 400 450 500 550 600 650 700

1 2 3 4 5 6 7

mse

c

# variables in Q

(d)

Fig. 11 Scalability of Sama on LUBM with respect to a the number I of extracted paths from G, b thenumber h of extracted paths from Q, c the number of nodes in Q and d the number of variables in Q

Effectiveness The last experiment evaluates the effectiveness of Sama and of theother competitors. The first measure we used is the reciprocal rank (RR). For a query,RR is the ratio between 1 and the rank at which the first correct answer is returned;or 0 if no correct answer is returned. In any dataset, for all 12 queries we obtained

123

Page 24: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

Fig. 12 Effectiveness on LUBM: bars refer each system, using different gray scales, i.e. from Sama , whitebars, to Dogma , black bars

RR = 1. In this case the monotonicity is never violated. To make a comparison withthe other systems we inspected the matches found in terms of the answers returned.Figure 12 shows the effectiveness of all systems on LUBM, where we run the querieswithout imposing the number k of answers.

In this case Sama and Sapper always identify more meaningful matches than bothBounded and Dogma . This is due to the approximation operated by Sama and Sapper

with respect to the others. We remind that the evaluation of the matches was performedby experts of the domain (e.g., LUBM). Finally, to support the meaningfulness ofresults, we measured the interpolation between precision and recall, that is for eachstandard level r j of recall (i.e. 0.1, . . ., 1.0) we calculate the average max precision ofqueries in [r j , r j+1], i.e. P(r j ) = maxr j≤r≤r j+1 P(r). Figure 13 shows the results onLUBM: for Sama we depict three different trends with respect to the range of |Q|. Asis to be expected, queries with limited number of paths presents the highest quality(i.e. a precision in the range [0.5,0.8]). More complex queries decrease the quality ofresults, due to more data elements retrieved by the approximation, presenting goodquality though. Such result confirms the feasibility of our system. The effectivenesson the other datasets follows a similar trend. On the other hand, as to be expected,the precision of the other systems dramatically decreases for large values of recall:Bounded and Dogma do not exploit an imprecise matching, while Sapper introducesnoise (i.e. not interesting approximate results) in high values of recall.

6 Related work

Many research efforts have focused on graph similarities, specially from the fieldof graph matching [8]. In fact, a first category of works relies on subgraph isomor-phism [18]. However the well-known intractability of the problem inspired approxi-mate approaches to simplify the problem [8,19]. In particular, graph simulation tech-niques has been used to make graph matching tractable. A second category of worksfocuses on the adoption of special indexes. In particular, several approaches haveproposed in-memory structures for indexing the nodes of the data graph [20], while

123

Page 25: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

Fig. 13 Effectiveness on LUBM: precision and recall of Sama

Fig. 14 An example of boundedquery

others have proposed specific indexes for the efficient execution of SPARQL queriesand joins [21]. In addition, other proposals tackle the problem by indexing graph sub-structures (e.g., paths, frequent subgraphs, trees). Typically, these indexes are exploitedin problems dealing with graph matching, to filter out graphs that do not match theinput query. Approaches in this area can be classified in graph indexing and subgraphindexing. In graph indexing approaches, such as gIndex [22], TreePi [23], and FG-Index [24], the graph database consists of a set of small graphs. The indexing aims atfinding all the database graphs that contain or are contained in a given query graph.On the other hand, subgraph indexing approaches, such as DOGMA [4], TALE [25],GADDI [9], SAPPER [6], and Zeng et al. [26] aim at indexing large database graph,with the goal of finding efficiently all (or a subset of) the subgraphs that match a givenquery. Finally, there are works on reachability [27,28] and distance queries [29] basedon testing the existence of a path between two nodes in the graph and on the evaluationof the distance between them. An interesting approach is proposed in [5] where theauthors reformulate the query graph in terms of a bounded query in which an edgedenotes the connectivity of nodes within a predefined number of hops. For instance,we can represent Q1 of Example 1 in terms of a bounded query as shown in Fig. 6.This guarantees a cubic time complexity for the graph matching problem (Fig. 14).

Most of the mentioned works are focused on medical, chemical and proteinic net-works and they are usually not efficient over semantic and social data [10]. Therefore,specialized metrics where proposed [10,30]. GMO [30] introduces a structural metricbased on a bipartite graph, NESS [10] proposes a measure based on both topolog-ical and content information in the neighborhood of a node of the graph. All theseapproaches differ quite a lot from our method. Indeed, we tackle the problem using a

123

Page 26: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

technique that takes into account the structural constraints on how different relationsbetween nodes have to be correlated. It relies on the tractable problem of alignmentbetween paths.

7 Conclusion and future work

In this paper we have presented a novel approach to approximate querying of largeRDF data sets. The approach is based on a strategy for building the answers of a queryby selecting and combining paths of the underlying data graph that best align withpaths in the query. A ranking function is used in query answering for evaluating therelevance of the results as soon as they are computed. In the worst case our techniqueexhibits a polynomial computational cost with respect to the size of the input andexperimental results show that it behaves very well with respect to approaches interms of both efficiency and effectiveness. This work opens several directions of furtherresearch. From a conceptual point of view, we intend to introduce improvements on theconstruction of answers and on the on-line computation of the scoring function. Froma practical point of view, we intend to implement the approach in a Grid environment(for instance using Hadoop/Hbase) and develop optimization techniques to speed-upthe creation and the update of the index, as well as compression mechanisms forreducing the overhead required by its construction and maintenance.

References

1. De Virgilio, R., Giunchiglia, F., Tanca, L. (eds.): Semantic Web Information Management—A Model-Based Perspective. Springer, Berlin (2010)

2. De Virgilio, R., Guerra, F., Velegrakis, Y. (eds.): Semantic Search Over the Web. Springer, Berlin,Heidelberg (2012)

3. De Virgilio, R., Orsi, G., Tanca, L., Torlone, R.: Nyaya: A system supporting the uniform managementof large sets of semantic data. In: ICD., pp. 1309–1312. (2012)

4. Bröcheler, M., Pugliese, A., Subrahmanian, V.S.: Dogma: A disk-oriented graph matching algorithmfor rdf databases. In: ISWC, pp. 97–113. (2009)

5. Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: from intractable to polynomialtime. Proc. VLDB Endow. 3(1), 264–275 (2010)

6. Zhang, S., Yang, J., Jin, W.: Sapper: subgraph indexing and approximate matching in large graphs.Proc. VLDB Endow. 3(1), 1185–1194 (2010)

7. Wood, P.T.: Query languages for graph databases. SIGMOD Rec. 41(1), 50–60 (2012)8. Gallagher, B.: Matching structure and semantics : A survey on graph-based pattern matching. In:

Artificial Intelligence, pp. 45–53. (2006)9. Zhang, S., Li, S., Yang, J.: Gaddi: distance index based subgraph matching in biological networks. In:

EDBT, pp. 192–203. (2009)10. Khan, A., Li, N., Yan, X., Guan, Z., Chakraborty, S., Tao, S.: Neighborhood based fast graph search

in large networks. In: SIGMOD, pp. 901–912. (2011)11. Iordanov, B.: Hypergraphdb: A generalized graph database. In: WAIM Workshops, pp. 25–36. (2010)12. Fellbaum, C. (ed.): WordNet An Electronic Lexical Database. The MIT Press, Cambridge (1998)13. Hassanzadeh, O., Consens, M.P.: Linked movie data base (triplification challenge report). In:

I-SEMANTICS, pp. 194–196 (2008)14. Bizer, C., Schultz, A.: The Berlin sparql benchmark. Int. J. Semant. Web. Inf. Syst. 5(2), 1–24 (2009)15. Guo, Y., Pan, Z., Heflin, J.: Lubm: a benchmark for owl knowledge base systems. J. Web. Semant.

3(2–3), 158–182 (2005)

123

Page 27: Approximate querying of RDF graphs via path alignment › ~maccioni › files › dapd2014.pdf · graph Q representing the query and a very large graph G representing the database

Distrib Parallel Databases

16. Ma, L., Yang, Y., Qiu, Z., Xie, G.T., Pan, Y., Liu, S.: Towards a complete owl ontology benchmark.In: ESWC, pp. 125–139. (2006)

17. Cappellari, P., De Virgilio, R., Maccioni, A., Roantree, M.: A path-oriented rdf index for keywordsearch query processing. In: DEXA, pp. 366–380. (2011)

18. Zou, L., Chen, L., Özsu, M.T.: Distance-join: pattern match query in a large graph database. Proc.VLDB Endow. 2(1), 886–897 (2009)

19. Fan, W., Bohannon, P.: Information preserving xml schema embedding. ACM Trans. Database Syst.33(1) (2008)

20. Tran, T., Wang, H., Rudolph, S., Cimiano, P.: Top-k exploration of query candidates for efficientkeyword search on graph-shaped (rdf) data. In: ICDE Conference, pp. 405–416 (2009)

21. Neumann, T., Weikum, G.: x-rdf-3x: fast querying, high update rates, and consistency for rdf databases.Proc. VLDB Endow. 3(1), 256–263 (2010)

22. Yan, X., Yu, P.S., Han, J.: Graph indexing: a frequent structure-based approach. In: SIGMOD,pp. 335–346. (2004)

23. Zhang, S., Hu, M., Yang, J.: Treepi: A novel graph indexing method. In: ICDE, pp. 966–975. (2007)24. Cheng, J., Ke, Y., Ng, W., Lu, A.: Fg-index: towards verification-free query processing on graph

databases. In: SIGMOD, pp. 857–872. (2007)25. Tian, Y., Patel, J.M.: Tale: A tool for approximate large graph matching. In: ICDE, pp. 963–972. (2008)26. Zeng, Z., Tung, A.K.H., Wang, J., Feng, J., Zhou, L.: Comparing stars: on approximating graph edit

distance. Proc. VLDB Endow. 2(1), 25–36 (2009)27. Jin, R., Xiang, Y., Ruan, N., Fuhry, D.: 3-hop: a high-compression indexing scheme for reachability

query. In: SIGMOD, pp. 813–826. (2009)28. Poulovassilis, A., Wood, P.T.: Combining approximation and relaxation in semantic web path queries.

In: ISWC, pp. 631–646. (2010)29. Chan, E.P.F., Lim, H.: Optimization and evaluation of shortest path queries. VLDB J. 16(3), 343–369

(2007)30. Hu, W., Jian, N., Qu, Y., Wang, Y.: Gmo: A graph matching for ontologies. In: Integrating Ontologies.

(2005)

123