Querying Knowledge Graphs by Example Entity Tuples · Our work is the ﬁrst to query knowledge graphs by example entity tuples. The paradigm of query-by-example(QBE) has a long history

arX

iv:1

311.

2100

v1 [

cs.D

B]

8 N

ov 2

013

Querying Knowledge Graphs by Example EntityTuples

Nandish Jayaram† Arijit Khan§ Chengkai Li† Xifeng Yan§ Ramez Elmasri†

†University of Texas at Arlington,§University of California, Santa Barbara

Abstract—We witness an unprecedented proliferation of knowl-edge graphs that record millions of entities and their relation-ships. While knowledge graphs are structure-flexible and content-rich, they are difficult to use. The challenge lies in the gapbetween their overwhelming complexity and the limited databaseknowledge of non-professional users. If writing structured queriesover “simple” tables is difficult, complex graphs are only harderto query. As an initial step toward improving the usability ofknowledge graphs, we propose to query such data by exampleentity tuples, without requiring users to form complex graphqueries. Our system,GQBE (Graph Query By Example), au-tomatically derives a weighted hidden maximal query graphbased on input query tuples, to capture a user’s query intent.It efficiently finds and ranks the top approximate answer tuples.For fast query processing,GQBE only partially evaluates querygraphs. We conducted experiments and user studies on the largeFreebase and DBpedia datasets and observed appealing accuracyand efficiency. Our system provides a complementary approachto the existing keyword-based methods, facilitating user-friendlygraph querying. To the best of our knowledge, there was no suchproposal in the past in the context of graphs.

I. I NTRODUCTION

There is an unprecedented proliferation ofknowledgegraphsthat record millions of entities (e.g., persons, products,organizations) and their relationships. Fig.1 is an excerpt of aknowledge graph, in which the edge labeledfounded betweennodesJerry Yang and Yahoo! captures the fact that the person isa founder of the company. Examples of real-world knowledgegraphs include DBpedia [3], YAGO [25], Freebase [4] andProbase [31]. Users and developers are tapping into knowledgegraphs for numerous applications, including search, recom-mendation and business intelligence.

Both users and application developers are often over-whelmed by the daunting task of understanding and usingknowledge graphs. This largely has to do with the sheersize and complexity of such data. As of March 2012, theLinking Open Data community had interlinked over 52 billionRDF triples spanning over several hundred datasets. Morespecifically, the challenges lie in the gap between complex dataand non-expert users. Knowledge graphs are often stored inrelational databases, graph databases and triplestores (cf. [19]for a tutorial).

In retrieving data from these databases, the norm is oftento use structured query languages such as SQL, SPARQL,and those alike. However, writing structured queries requiresextensive experiences in query language and data model andgood understanding of particular datasets [12]. Graph dataisnot “easier” than relational data in either query language or

Fig. 1. An Excerpt of a Knowledge Graph

Fig. 2. Query Interface ofGQBE

data model. The fact it is schema-less makes it even moreintangible to understand and query.If querying “simple” tablesis difficult, aren’t complex graphs harder to query?

Motivated by the aforementioned usability challenge, webuild GQBE 1 (Graph Query by Example), a system thatqueries knowledge graphs by example entity tuples insteadof graph queries. Given a data graph and a query tupleconsisting of entities,GQBE finds similar answer tuples.Consider the data graph in Fig.1 and an scenario where aSilicon Valley business analyst is interested in finding en-trepreneurs who have founded technology companies head-quartered in California. Suppose she knows an example 2-entity query tuple such as〈Jerry Yang, Yahoo!〉 that satisfies herquery intent. As the query interface in Fig. 2 shows, enteringsuch an example tuple toGQBE is simple, especially withthe help of user interface tools such as auto-completion inidentifying the exact entities in the data graph. The answertuples can be〈Steve Wozniak, Apple Inc.〉 and 〈Sergey Brin, Google〉,which are founder-company pairs. If the query tuple consists

1 A description of GQBE’s user interface and demonstration scenar-ios can be found in [14]. An accompanying demonstration video is athttp://www.youtube.com/watch?v=4QfcV-OrGmQ.

1

http://arxiv.org/abs/1311.2100v1

http://www.youtube.com/watch?v=4QfcV-OrGmQ

Fig. 3. The Architecture and Components ofGQBE

of 3 or more entities (e.g.,〈Jerry Yang, Yahoo!, Sunnyvale〉), theanswers will be similar tuples of the same cardinality (e.g.,〈Steve Wozniak, Apple Inc., Cupertino〉).

Our work is the first to query knowledge graphs by exampleentity tuples. The paradigm ofquery-by-example(QBE) has along history in relational databases [35]. The idea is to expressqueries by filling example tables with constants and sharedvariables in multiple tables, which correspond to selectionand join conditions, respectively. Its simplicity and improveduser productivity make QBE an influential database querylanguage. By proposing to query knowledge graphs by exam-ple tuples, our premise is that the QBE paradigm will enjoysimilar advantages on graph data. The technical challengesand approaches are vastly different, due to the fundamentallydifferent data models.

Substantial progress has been made on query mechanismsthat help users construct query graphs or even do not requireexplicit query graphs. Such mechanisms include keywordsearch (e.g., [16]), keyword-based query formulation [24],[33], natural language questions [32], interactive and form-based query formulation [8], [13], and visual interface forquery graph construction [6], [15]. Little has been done oncomparison across these graph query mechanisms. Whilea usability comparison of these mechanisms andGQBE isbeyond the scope of this paper, we note that they all havepros and cons and thus complement each other.

Particularly, QBE and keyword-based methods are adequatefor different usage scenarios. Using keyword-based methods,a user has to articulate query keywords, e.g., “technologycompanies head-quartered in California and their founders”for the aforementioned analyst. Not only a user may findit challenging to clearly articulate a query, but also a querysystem might not return accurate answers, since it is non-trivialto precisely separate these keywords and correctly match themwith entities, entity types and relationships. This has been ver-ified through our own experience on a keyword-based systemadapted from SPARK [22]. In contrast, aGQBE user onlyneeds to know the names of some entities in example tuples,without being required to specify how exactly the entities arerelated. On the other hand, keyword-based querying is moreadequate when a user does not know a few sample answerswith respect to her query.

In the literature on graph query, the input to a querysystem in most cases is a structured query, which is often

graphically presented as a query graph or pattern. Such isnot what we refer to as query-by-example, because the querygraphs and patterns are formed by using structured query lan-guages or the aforementioned query mechanisms. For instance,PathSim [26] finds the top-k similar entities that are connectedto a query entity, based on a user-defined meta-path semanticsin a heterogeneous network. In [34], given a query graph asinput, the system finds structurally isomorphic answer graphswith semantically similar entity nodes. In contrast,GQBE onlyrequires a user to provide an entity tuple, without knowing theunderlying schema.

There are several challenges in buildingGQBE. Belowwe provide a brief overview of our approach in tacklingthese challenges. The ensuing discussion refers to the systemarchitecture and components ofGQBE, as shown in Fig. 3.

(1) With regard toquery semantics, since the input toGQBEis a query tuple instead of an explicit query graph, the systemmust derive a hidden query graph based on the query tuple,to capture user’s query intent. Thequery graph discoverycomponent (Sec.III) ofGQBE fulfills this requirement andthe derived graph is termed amaximal query graph(MQG).The edges in MQG, weighted by several frequency-based anddistance-based heuristics, represent important “features” of thequery tuple to be matched in answer tuples. More concretely,they capture how entities in the query tuple (i.e., nodes in adata graph) and their neighboring entities are related to eachother. Answer graphs matching the MQG are projected toanswer tuples, which consist of answer entities correspondingto the query tuple entities.GQBE further supports multiplequery tuples as input which collectively better capture theuserintent.

(2) With regard toanswer space modeling(Sec.IV), therecan be a large space of approximate answer graphs (tuples),since it is unlikely to find answer graphs exactly matchingthe MQG. GQBE models the space of answer tuples by aquery latticeformed by the subsumption relation between allpossible query graphs. Each query graph is a subgraph of theMQG and contains all query entities. Its answer graphs are alsosubgraphs of the data graph and are isomorphic to the querygraph. Given an answer graph, its entities corresponding tothe query tuple entities form an answer tuple. Thus the answertuples are essentially approximate answers to the MQG. Forranking answer tuples, their scores are calculated based onthe edge weights in their query graphs and the match between

2

nodes in the query and answer graphs.(3) The query lattice can be large. To obtain top-k ranked

answer tuples, the brute-force approach of evaluating all querygraphs in the lattice can be prohibitively expensive. Forefficient query processing(Sec.V), GQBE employs a top-klattice exploration algorithm that only partially evaluates thelattice nodes in the order of their corresponding query graphs’upper-bound scores.

We summarize the contributions of this paper as follows:• For better usability of knowledge graph querying systems,we propose a novel approach of querying by example entitytuples, which saves users the burden of forming explicit querygraphs. To the best of our knowledge, there was no suchproposal in the past.

• The query graph discovery component ofGQBE derives ahidden maximal query graph (MQG) based on input querytuples, to capture users’ query intent.GQBE models the spaceof query graphs (and thus answer tuples) by a query latticebased on the MQG.

• GQBE’s efficient query processing algorithm only partiallyevaluates the query lattice to obtain the top-k answer tuplesranked by how well they approximately match the MQG.

• We conducted extensive experiments and user study on thelarge Freebase and DBpedia datasets to evaluateGQBE’saccuracy and efficiency (Sec.VI). The comparison with astate-of-the-art graph querying frameworkNESS [18] (usingMQG as input) shows thatGQBE is twice as accurate asNESS and outperformsNESS on efficiency in most of thequeries.

II. PROBLEM FORMULATION

GQBE runs queries on knowledge data graphs. Adatagraph is a directed multi-graphG with node setV (G) andedge setE(G). Each nodev∈V (G) represents an entity andhas a unique identifierid(v). 2 Each edgee=(vi, vj)∈E(G)denotes a directed relationship from entityvi to entity vj . Ithas a label, denoted aslabel(e). Multiple edges can have thesame label. The user input and output ofGQBE are both entitytuples, calledquery tuples andanswer tuples, respectively. Atuple t=〈v1, . . . , vn〉 is an ordered list of entities (i.e., nodes)in G. The constituting entities of query (answer) tuples arecalled query (answer) entities. Given a data graphG and aquery tuplet, our goal is to find the top-k answer tuplest′

with the highest similarity scoresscoret(t′).

We definescoret(t′) by matching the inter-entity relation-

ships oft and that oft′, which entails matching two graphsconstructed fromt andt′, respectively. To this end, we definethe neighborhood graphfor a tuple, which is based on theconcept of undirected path. Anundirected pathis a path whoseedges are not necessarily oriented in the same direction. Unlessotherwise stated, we will refer to undirected path simply as“path”. We consider undirected path because an edge incidenton a node can represent an important relationship with anothernode, regardless of its direction. More formally, a pathp

2 Without loss of generality, we use an entity’s name as its identifier inpresenting examples, assuming entity names are unique.

Fig. 4. Neighborhood Graph for〈Jerry Yang, Yahoo!〉

is a sequence of edgese1, . . . , en and we say each edgeei ∈ p. The path connects two nodesv0 and vn throughintermediate nodesv1, . . . , vn−1, where eitherei=(vi−1, vi) orei=(vi, vi−1), for all 1≤i≤n. The length of the path,len(p), isn and the endpoints of the path,ends(p), are{v0, vn}. Notethat there is no undirected cycle in a path, i.e., the entitiesv0, . . . , vn are all distinct.

Definition 1 The neighborhood graph of query tuplet, de-notedHt, is the weakly connected subgraph3 of data graphG that consists of all nodes reachable from at least one queryentity by an undirected path ofd or less edges (including queryentities themselves) and the edges on all such paths. Thepathlength threshold, d, is an input parameter. More formally, thenodes and edges inHt are defined as follows:V (Ht) = {v|v ∈ V (G) and∃p s.t. ends(p)={vi, v} where

vi ∈ t, len(p) ≤ d};E(Ht) = {e|e ∈ E(G) and∃p s.t. ends(p)={vi, v} where

vi ∈ t, len(p) ≤ d, ande ∈ p}.

Example 1 (Neighborhood Graph) Given the data graph inFig.1, Fig.4 shows the neighborhood graph for query tuple〈Jerry Yang, Yahoo!〉 with path length thresholdd=2. The nodes indark color are the query entities.

Intuitively, the neighborhood graph, by capturing how queryentities and other entities in their neighborhood are related toeach other, represents “features” of the query tuple that areto be matched in query answers. It can thus be viewed as ahidden query graph derived for capturing user’s query intent.We are unlikely to find query answers that exactly matchthe neighborhood graph. It is however possible to find exactmatches to its subgraphs. Such subgraphs are all query graphsand their exact matches are approximate answers that matchthe neighborhood graph to different extents.

Definition 2 A query graph Q is a weakly connected sub-graph ofHt that contains all the query entities. We useQt to

3 A directed graph isweakly connectedif there exists an undirected pathbetween every pair of vertices.

3

Fig. 5. Two Query Graphs in Fig.4 Fig. 6. Two Answer Graphs for Fig.5(a) Fig. 7. Two Answer Graphs for Fig.5(b)

denote the set of all query graphs fort, i.e., Qt={Q|Q is aweakly connected subgraph ofHt s.t. ∀v ∈ t, v ∈ V (Q)}.

Continuing the running example, Fig.5 shows two querygraphs for the neighborhood graph in Fig.4.

Echoing the intuition behind neighborhood graph, the defi-nitions of answer graph and answer tuple are based on the ideathat an answer tuple is similar to the query tuple if their entitiesparticipate in similar relationships in their neighborhoods.

Definition 3 An answer graph A to a query graphQ is aweakly connected subgraph ofG that is isomorphic toQ.Formally, there exists a bijectionf :V (Q)→V (A) such that:

• For every edgee = (vi, vj) ∈ E(Q), there exists an edgee′ = (f(vi), f(vj)) ∈ E(A) such thatlabel(e) = label(e′);• For every edgee′ = (ui, uj) ∈ E(A), there existse = (f−1(ui), f

−1(uj)) ∈ E(Q) such thatlabel(e) =label(e′).

For a query tuplet=〈v1, . . . , vn〉, the answer tuple in A istA=〈f(v1), . . . , f(vn)〉. We also calltA the projectionof A.

We useAQ to denote the set of all answer graphs ofQ.We note that a query graph (tuple) trivially matches itself,therefore is not considered an answer graph (tuple).

Example 2 (Answer Graph and Answer Tuple) Fig.6 andFig.7 each show two answer graphs for query graphs Fig.5(a)and Fig.5(b), respectively. The answer tuples in Fig.6 are〈Steve Wozniak, Apple Inc.〉 and 〈Sergey Brin, Google〉. The answer tu-ples in Fig.7 are〈Bill Gates,Microsoft〉 and 〈Sergey Brin, Google〉.

Definition 4 The set of answer tuples for query tuplet are{tA|A∈AQ, Q∈Qt}. The score of an answer t′ is given by

scoret(t′) = max

A∈AQ,Q∈Qt

{scoreQ(A)|t′ = tA} (1)

The score of an answer graphA (scoreQ(A)) capturesA’ssimilarity to query graphQ. Its equation is given in Sec.IV-B.

The same answer tuplet′ may be projected from multipleanswer graphs, which can match different query graphs. Forinstance, Figs. 6(b) and 7(b), which are answers to differentquery graphs, have the same projection—〈Sergey Brin, Google〉.By Eq. (1), the highest score attained by the answer graphs isassigned as the score oft′, capturing how wellt′ matchest.

III. QUERY GRAPH DISCOVERY

A. Maximal Query Graph

The concept of neighborhood graphHt (Def.1) was formedto capture the features of a query tuplet to be matchedby answer tuples. Given a well-connected large data graph,Ht itself can be quite large, even under a small path lengththresholdd. For example, using Freebase as the data graph, thequery tuple〈Jerry Yang, Yahoo!〉 produces a neighborhood graphwith 800K nodes and900K edges, ford=2. Such a largeHt

makes query semantics obscure, because there might be onlyfew nodes and edges in it that capture important relationshipsin the neighborhood oft.GQBE’s query graph discovery component constructs a

weightedmaximal query graph(MQG) from the neighborhoodgraphHt. MQG is expected to be drastically smaller thanHt

and capture only important features of the query tuple. Wenow define MQG and discuss its discovery algorithm.

Definition 5 The maximal query graph MQGt, given aparameterm, is a weakly connected subgraph of the neigh-borhood graphHt that maximizes total edge weight

∑

e w(e)while satisfying (1) it contains all query entities int and (2) ithasm edges. The weight of an edgee in Ht, w(e), is definedin Sec.III-B.

There are two challenges in findingMQGt by directlygoing after the above definition. First, a weakly connectedsubgraph ofHt with exactlym edges may not exist for anarbitrarym. A trivial value ofm that guarantees the existenceof the correspondingMQGt is |E(Ht)|, becauseHt itselfis weakly connected. This value could be too large, which isexactly why we aim to makeMQGt substantially smaller thanHt. Second, even ifMQGt exists for anm, finding it requiresmaximizing the total edge weight, which is a hard problem asgiven in Theorem 1.

Theorem 1 The decision version of finding the maximalquery graphMQGt for anm is NP-hard.

Proof We prove the NP-hardness by reduction from the NP-hard constrained Steiner network (CSN) problem [20]. Givenan undirected connected graphG1=(V,E) with non-negativeweight w(e) for every e∈E, a subsetVn⊂V , and a positiveinteger m, the CSN problem is to find a connected sub-graphG′=(V ′, E′) with the smallest total edge weight, where

4

Algorithm 1: Discovering the Maximal Query GraphInput : neighborhood graphHt, query tuplet, an integerrOutput : maximal query graphMQGt

1 m← r|t|+1

; V (MQGt)← φ; E(MQGt)← φ; G ← φ;2 foreach vi ∈ t do3 Gvi ← use DFS to obtain the subgraph containing vertices (and

their incident edges) that connect to othervj in t only throughvi;4 G ← G ∪ {Gvi};

5 Gcore ← use DFS to obtain the subgraph containing vertices andedges on undirected paths between query entities;

6 G ← G ∪ {Gcore};7 foreach G ∈ G do8 step← 1; s1 ← 0; s← m;9 while s > 0 do

10 Ms ← the weakly connected component found from thetop-s edges ofG that contains all ofG’s query entities;

11 if Ms existsthen12 if |E(Ms)| = m then break13 if |E(Ms)| < m then14 s1 ← s;15 if step = −1 then break

16 if |E(Ms)| > m then17 if s1 > 0 then18 s← s1; break;

19 s2 ← s; step← −1;

20 s← s+ step;

21 if s = 0 then s← s222 V (MQGt)← V (MQGt) ∪ V (Ms);23 E(MQGt)← E(MQGt) ∪ E(Ms);

Vn⊆V ′ and |E′|=m. The polynomial-time reduction from theCSN problem to the MQG problem is by transformingG1 toG2, where each edgee is given an arbitrary direction and anew weightw′(e)=W−w(e), whereW=

∑

e∈E w(e). Let Vn

be the query tuple. The maximal query graphMQGVnfound

from G2 provides a CSN inG1, by ignoring edge direction.This completes the proof.

Based on the theoretical analysis, we present a greedymethod (Alg.1) to find a plausible sub-optimal graph of edgecardinalitycloseto a givenm. The value ofm is empiricallychosen to be much smaller than|E(Ht)|. Consider edges ofHt in descending order of weightw(e). We useGs to denotethe graph formed by the tops edges with the largest weights,which itself may not be weakly connected. We useMs todenote the weakly connected component ofGs containing allquery entities int, if it exists. Our method finds the smallestssuch that|E(Ms)|=m (Line 12). If such anMs does not exist,the method choosess1, the largests such that|E(Ms)|<m.If that still does not exist, it choosess2, the smallestssuch that|E(Ms)|>m, whose existence is guaranteed because|E(Ht)|>m. For eachs value, the method employs a depth-first search (DFS) starting from a query entity inGs, if present,to check the existence ofMs (Line 10).

The Ms found by this method may be unbalanced. Queryentities with more neighbors inHt likely have more prominentrepresentation in the resultingMs. A balanced graph shouldinstead have a fair number of edges associated with eachquery entity. Therefore, we further propose a divide-and-conquer mechanism to construct a balancedMQGt. The idea

is to breakHt into n+1 weakly connected subgraphs. Oneis the core graph, which includes all then query entities int and all undirected paths between query entities. Othernsubgraphs are for then query entities individually, where thesubgraph for entityvi includes all entities (and their incidentedges) that connect to other query entities only throughvi.The subgraphs are identified by a DFS starting from eachquery entity (Lines 4-6 of Alg.1). During the DFS fromvi,all edges on the undirected paths reaching any other queryentity within distanced belong to the core graph, and otheredges belong tovi’s individual subgraph. The method thenapplies the aforementioned greedy algorithm to findn+1weakly connected components, one for each subgraph, thatcontain the query entities in corresponding subgraphs. Sincethe core graph connects all query entities, then+1 componentsaltogether form a weakly connected subgraph ofHt, whichbecomes the finalMQGt. For an empirically chosen smallras the target size ofMQGt, we set the target size for eachindividual component to ber

n+1 , aiming at a balancedMQGt.

Complexity Analysis of Alg.1 In the aforementioned divide-and-conquer method, if on average there arer′= |E(Ht)|

n+1 edgesin each subgraph, finding the subgraph by DFS and sortingits r′ edges takesO(r′ log r′) time. Given the top-s edgesof a subgraph, checking if the weakly connected compo-nent Ms exists using DFS requiresO(s) time. Suppose onaveragec iterations are required to find the appropriates.Let m= r

n+1 be the average target edge cardinality of eachsubgraph. Since the method initializess with m, the largestvalue s can attain ism+c. So the time for discoveringMs

for each subgraph isO(r′ log r′+c×(m+c)). For all n+1subgraphs, the total time required to find the finalMQGt isO((n+1)×(r′ log r′+c×(m+c))). For the queries used in ourexperiments on Freebase, given an empirically chosen smallr=15, s≪|E(Ht)| and on averagec=22.

B. Edge Weighting

The definition ofMQGt (Def.5) depends on edge weights.There can be various plausible weighting schemes. We proposea weighting function based on several heuristic ideas. Theweight of an edgee in Ht, w(e), is proportional to its inverseedge label frequency (ief(e)) and inversely proportional to itsparticipation degree (p(e)), given by

w(e) = ief(e) / p(e) (2)

Inverse Edge Label Frequency Edge labels that appearfrequently in the entire data graphG are often less important.For example, edges labeledfounded (for a company’s founders)can be rare and more important than edges labelednationality

(for a person’s nationality). We capture this by theinverseedge label frequency.

ief(e) = log (|E(G)| / #label(e)) (3)

where|E(G)| is the number of edges inG, and#label(e) isthe number of edges inG with the same label ase.

5

Participation Degree The participation degreep(e) of anedgee=(u, v) is the number of edges inG that share the samelabel and one ofe’s end nodes. Formally,

p(e) = | {e′=(u′, v′) | label(e)=label(e′), u′=u ∨ v′=v} | (4)

While ief(e) captures the global frequencies of edge labels,p(e) measures their local frequencies—an edge is less impor-tant if there are other edges incident on the same node withthe same label. For instance,employment might be a relativelyrare edge globally but not necessarily locally to a company.Specifically, consider the edges representing theemployment

relationship between a company and itsmany employeesand the edges for theboard member relationship between thecompany and itsfew board members. The latter edges aremore significant.

Note thatief(e) andp(e) are precomputed offline, since theyare query-independent and only rely on the data graphG.

C. Preprocessing: Reduced Neighborhood Graph

The discussion so far focuses on discoveringMQGt fromHt. The neighborhood graphHt may have clearly unimportantedges. As a preprocessing step,GQBE removes such edgesfrom Ht before applying Alg.1. The reduced size ofHt notonly makes the execution of Alg.1 more efficient but also helpsprevent clearly unimportant edges from getting intoMQGt.

Consider the neighborhood graphHt in Fig.4, based on thedata graph excerpt in Fig.1. Edgee1=(Jerry Yang, Stanford) andlabel(e1)=education. Two other edges labelededucation, e2 ande3, are also incident on nodeStanford. The neighborhood graphfrom a complete real-world data graph may contain many suchedges for people graduated from Stanford University. Amongthese edges,e1 represents an important relationship betweenStanford and query entityJerry Yang, while other edges representrelationships betweenStanford and other entities, which aredeemed unimportant with respect to the query tuple.

We formalize the definition ofunimportant edgesas follows.Given an edgee=(u, v) ∈ E(Ht), e is unimportant if it isunimportant from the perspective of its either end,u or v, i.e.,if e ∈ UE(u) or e ∈ UE(v). Given a nodev ∈ V (Ht), E(v)denotes the edges incident onv in Ht. E(v) is partitionedinto three disjoint subsets—the important edgesIE(v), theunimportant edgesUE(v) and the rest—defined as follows:IE(v)={e ∈ E(v) | ∃vi∈t, p s.t. e∈p, ends(p)={v, vi}, len(p)≤d};

UE(v)={e ∈ E(v) | e/∈IE(v), ∃e′∈IE(v) s.t. label(e)=label(e′),(e=(u, v) ∧ e′=(u′, v)) ∨ (e=(v, u) ∧ e′=(v, u′))}.

An edgee incident onv belongs toIE(v) if there exists apath betweenv and any query entity in the query tuplet,throughe, with path length at mostd. For example, edgee1in Fig.4 belongs toIE(Stanford). An edgee belongs toUE(v)if (1) it does not belong toIE(v) (i.e., there exists no suchaforementioned path) and (2) there existse′ ∈ IE(v) such thate ande′ have the same label and they are both either incominginto or outgoing fromv. By this definition,e2 ande3 belong

to UE(v) in Fig.4, sincee1 belongs toIE(v). In the sameneighborhood graph,e4 is in neitherIE(v) nor UE(v).

All edges deemed unimportant by the above definition areremoved fromHt. The resulting graph may not be weaklyconnected anymore and may have multiple weakly connectedcomponents.4 Theorem 2 states that one of the components—called thereduced neighborhood graph, denotedH ′

t—containsall query entities int. In other words,H ′

t is the largest weaklyconnected subgraph ofHt containing all query entities andno unimportant edges. Alg.1 is applied onH ′

t (instead ofHt) to produceMQGt. Since the techniques in the ensuingdiscussion only operate onMQGt, the distinction betweenHt

andH ′t will not be further noted.

Theorem 2 Given the neighborhood graphHt for a querytuple t, the reduced neighborhood graphH ′

t always exists.

Proof We prove by contradiction. Suppose that, after removalof all unimportant edges,Ht becomes a disconnected graph,of which none of the weakly connected components containsall the query entities. The deletion of unimportant edges musthave disconnected at least a pair of query entities, say,viandvj . By Def. 1, before removal of unimportant edges,Ht

must have at least a pathp of length at mostd betweenviand vj . By the definition of unimportant edges, every edgee=(u, v) on p belongs to bothIE(u) and IE(v) and thuscannot be an unimportant edge. However, the fact thatviand vj become disconnected implies thatp consists of atleast one unimportant edge which is deleted. This presentsa contradiction and completes the proof.

D. Multi-tuple Queries

The query graph discovery component ofGQBE essentiallyderives a user’s query intent from input query tuples. Forthat, a single query tuple might not be sufficient. Whilethe experiment results in Sec.VI show that a single-tuplequery obtains excellent accuracy in many cases, the resultsalso exhibit that allowing multiple query tuples often helpsimprove query answer accuracy. This is because importantrelationships commonly associated with multiple query tuplesexpress the user intent more precisely. For instance, suppose auser has provided two query tuples together—〈Jerry Yang, Yahoo!〉and 〈Steve Wozniak, Apple Inc.〉. The query entities in both tuplesshare common properties such asplaces lived in San Jose andheadquartered in a city in California, as shown in Fig.1. This mightindicate that the user is interested in finding people from SanJose who founded technology companies in California.

Given a set of tuplesT , GQBE aims at finding top-kanswer tuples similar toT collectively. To accomplish this,one approach is to discover and evaluate the maximal querygraphs (MQGs) of individual query tuples. The scores of acommon answer tuple for multiple query tuples can then beaggregated. This has two potential drawbacks: (1) Our concernof not being able to well capture user intent still remains. If

4 A weakly connected component of a directed graph is a maximalsubgraphwhere an undirected path exists for every pair of vertices.

6

Fig. 8. Merging Maximal Query Graphs

k is not large enough, a good answer tuple may not appearin enough individual top-k answer lists, resulting in pooraggregated score. (2) It can become expensive to evaluatemultiple MQGs.

We approach this problem by producing a merged andre-weighted MQG that captures the importance of edgeswith respect to their presence across multiple MQGs. Themerged MQG is then processed by the same method forsingle-tuple queries.GQBE employs a simple strategy tomerge multiple MQGs. The individual MQG for a querytuple ti=〈vi1, v

i2, . . . , v

in〉∈T is denotedMti . A virtual MQG

M ′ti

is created for everyMti by replacing the query enti-ties vi1, v

i2, . . . , v

in in Mti with corresponding virtual entities

w1, w2, . . . , wn in M ′ti

. Formally, there exists a bijectivefunction g:V (Mti)→V (M ′

ti) such that (1)g(vij)=wj and

g(v)=v if v/∈ti, and (2)∀e=(u, v)∈E(Mti), there exists anedgee′=(g(u), g(v)) ∈E(M ′

ti) such thatlabel(e)=label(e′);

∀e′=(u′, v′)∈E(M ′ti), ∃e =(g−1(u′), g−1(v′))∈E(Mti) such

that label(e)=label(e′).The merged MQG is denotedMQGT . It is produced by

including vertices and edges in allM ′ti

, merging identicalvirtual and regular vertices, and merging identical edges thatbear the same label and the same vertices on both ends.Formally,

V (MQGT ) =⋃

ti∈T

V (M ′ti) andE(MQGT ) =

⋃

ti∈T

E(M ′ti).

The edge cardinality ofMQGT might be larger than the targetsize r. Thus Alg.1 proposed in Sec.III-A is also used to trimMQGT to a size close tor. In MQGT , the weight of anedgee is given byc∗wmax(e), wherec is the number ofM ′

ti

containinge and wmax(e) is its maximal weight among allsuchM ′

ti.

Example 3 (Merging Maximal Query Graphs) Let Figs. 8(a) and (b) be theMti for query tuples〈Steve Wozniak, Apple Inc.〉and 〈Jerry Yang, Yahoo!〉, respectively. Fig.8(c) is the mergedMQGT . Note that entitiesSteve Wozniak and Jerry Yang aremapped tow1 in their respectiveM ′

ti(not shown, for its

mapping fromMti is simple) and are merged intow1 inMQGT . Similarly, entities Apple Inc. and Yahoo! are mappedand merged intow2. The two founded edges, appearing in bothindividualMti and sharing identical vertices on both ends (w1

Fig. 9. Maximal Query Graph and Query Lattice

and w2) in the correspondingM ′ti

, are merged inMQGT .Similarly the twoplaces lived edges are merged. However, thetwo headquartered in edges are not merged, since they share onlyone end (w2) in M ′

ti. The edgesnationality and education, which

appear in only oneMti , are also present inMQGT . Thenumber next to each edge is its weight.

In comparison to evaluating a single-tuple query, the extraoverhead in handling a multi-tuple query includes creatingmultiple MQGs, which is |T | times the average cost ofdiscovering an individual MQG, and merging them, which islinear in the total edge cardinality of all MQGs.

IV. A NSWERSPACE MODELING

Given the maximal query graphMQGt for a tuple t, wemodel the space of possible query graphs by a lattice. Wefurther discuss the scoring of answer graphs by how well theymatch query graphs.

A. Query Lattice

Definition 6 The query lattice L is a partially ordered set(poset) (QGt, ≺), where≺ represents the subgraph-supergraphsubsumption relation andQGt is the subset of query graphs(Def.2) that are subgraphs ofMQGt, i.e., QGt={Q|Q ∈Qt andQ � MQGt}. The top element (root) of the poset isthusMQGt. When represented by a Hasse diagram, the posetis a directed acyclic graph, in which each node correspondsto a distinct query graph inQGt. Thus we shall use the termslattice nodeand query graphinterchangeably. Thechildren(parents) of a lattice nodeQ are its subgraphs (supergraphs)with one less (more) edge, as defined below.

Children(Q) = {Q′|Q′ ∈ QGt, Q′ ≺ Q, |E(Q)|−|E(Q′)|=1}

Parents(Q) = {Q′|Q′ ∈ QGt, Q ≺ Q′, |E(Q′)|−|E(Q)|=1}

The leaf nodes ofL constitute of theminimal query trees,which are those query graphs that cannot be made any simplerand yet still keep all the query entities connected.

Definition 7 A query graphQ is a minimal query treeif none of its subgraphs is also a query graph. In otherwords, removing any edge fromQ will disqualify it frombeing a query graph—the resulting graph either is not weaklyconnected or does not contain all the query entities. Note thatsuch aQ must be a tree.

7

Example 4 (Query Lattice and Minimal Query Tree)Fig.9(a) shows a maximal query graphMQGt, whichcontains two query entities in shaded circles and five edgesF,G,H,L, and P . Its corresponding query latticeL is inFig.9(b). The root node ofL, denotedFGHLP , representsMQGt itself. The two bottom nodes,F andHL, are the twominimal query trees. Each lattice node is a distinct subgraphof MQGt. For example, the nodeFLP represents a querygraph with only edgesF,L and P . Note that there is nolattice node forGLP , which is not a valid query graph sinceit is not connected.

The construction of the query lattice, i.e., the generationof query graphs corresponding to its nodes, is integratedwith its exploration. In other words, the lattice is built ina “lazy” manner—a lattice node is not generated until thequery algorithm (Sec.V) must evaluate it. The lattice nodesare generated in a bottom-up way. A node is generated byadding exactly one appropriate edge to the query graph forone of its children. The generation of bottom nodes, i.e., theminimal query trees, is described below.

By definition, a minimal query tree can only contain edgeson undirected paths between query entities. Hence, it must bea subgraph of the weakly connected componentMs foundfrom the core graph described in Sec.III-A. To generateall minimal query trees, our method enumerates all distinctspanning trees ofMs by the technique in [10] and then trimthem. Specifically, given one such spanning tree, all non-query entities (nodes) of degree one along with their edgesare deleted. The deletion is performed iteratively until there isno such node. The result is a minimal query tree. Only distinctminimal query trees are kept. Enumerating all spanning treesin a large graph is expensive. However, in our experiments onthe Freebase dataset, theMQGt discovered by the approachin Sec.III mostly contains less than15 edges. Hence, theMs

from the core graph is also empirically small, for which thecost of enumerating all spanning trees is negligible.

B. Answer Graph Scoring Function

The score of an answer graphA (scoreQ(A)) capturesA’ssimilarity to the query graphQ. It is defined below and is tobe plugged into Eq. (1) for defining answer tuple score.

scoreQ(A) = s score(Q) + c scoreQ(A)

s score(Q) =∑

e∈E(Q)

w(e)

c scoreQ(A) =∑

e=(u,v)∈E(Q)e′=(f(u),f(v))∈E(A)

match(e, e′)(5)

In Eq. (5), scoreQ(A) sums up two components—thestructure scoreof Q (s score(Q)) and thecontent scoreforA matchingQ (c scoreQ(A)). s score(Q) is the total edgeweight of Q. It measures the important structure inMQGt

that is captured byQ and thus byA. c scoreQ(A) is the totalextra credit for identical nodes among the matching nodes inAandQ given byf—the bijection betweenV (Q) andV (A) as

in Def.3. For instance, among the6 pairs of matching nodesbetween Fig.5(a) and Fig.6(a), the identical matching nodesareUSA, San Jose andCalifornia. The rationale for the extra creditis that although node matching is not mandatory, the morenodes are matched, the more similarA andQ are.

The extra credit is defined by the following functionmatch(e, e′). Note that it does not award an identical matchingnode excessively. Instead, only a fraction ofw(e) is awarded,where the denominator is either|E(u)| or |E(v)|. (E(u) arethe edges incident onu in MQGt.) This heuristic is based onthat, whenu andf(u) are identical, many of their neighborscan be also identical matching nodes.

match(e, e′)=

w(e)|E(u)| if u=f(u)w(e)|E(v)| if v=f(v)

w(e)min(|E(u)|,|E(v)|) if u=f(u), v=f(v)

0 otherwise

(6)

In discoveringMQGt from Ht by Alg.1, the weights ofedges inHt are defined by Eq. (2) which does not consider anedge’s distance from the query tuple. The rationale behind thedesign is to obtain a balancedMQGt which includes not onlyedges incident on query entities but also those in the largerneighborhood. For scoring answers by Eq. (5) and Eq. (6),however, our empirical observations show it is imperative todifferentiate the importance of edges inMQGt with respectto query entities, in order to capture how well an answer graphmatchesMQGt. Edges closer to query entities convey moremeaningful relationships than those farther away. Hence, wedefine edge depth (d(e)) as follows. The largerd(e) is, theless importante is.

Edge Depth The depthd(e) of an edgee=(u, v) is itssmallest distance to any query entityvi ∈ t, i.e.,

d(e) = minvi∈t

minu,v

{dist(u, vi), dist(v, vi)} (7)

Here,dist(., .) is the shortest length of all undirected paths inMQGt between the two nodes.

In summary,GQBE uses Eq. (2) as the definition ofw(e)in weighting edges inHt. After MQGt is discovered fromHt by Alg.1, it uses the following Eq. (8) as the definition ofw(e) in weighting edges inMQGt. Eq. (8) incorporatesd(e)into Eq. (2). The answer graph scoring functions Eq. (5) andEq. (6) are based on Eq. (8).

w(e) = ief(e) / (p(e)× d2(e)) (8)

V. QUERY PROCESSING

The query processing component ofGQBE takes the max-imal query graphMQGt (Sec.III) and the query latticeL(Sec.IV) and finds answer graphs matching the query graphsin L. Before we discuss howL is evaluated (Sec.V-B), weintroduce the storage model and query plan for processingone query graph (Sec.V-A).

8

A. Processing One Query Graph

The abstract data model of knowledge graph can be repre-sented by the Resource Description Framework (RDF)—thestandard Semantic Web data model. In RDF, a data graphis parsed into a set of triples, each representing an edgee=(u, v). A triple has the form (subject, property, object),corresponding to (u, label(e), v). Among different schemesof RDF data management, one important approach is to userelational database techniques to store and query RDF graphs.To store a data graph, we adopt this approach and, particularly,the vertical partitioning method [1]. This method partitions adata graph into multiple two-column tables. Each table is fora distinct edge label and stores all edges bearing that label.The two columns are (subj, obj), for the edges’ source anddestination nodes, respectively. For efficient query processing,two in-memory search structures (specifically, hash tables) arecreated on the table, usingsubj and obj as the hash keys,respectively. The whole data graph is hashed in memory bythis way, before any query comes in.

Given the above storage scheme, to evaluate a query graphis to process a multi-way join query. For instance, the querygraph in Fig.9(a) corresponds toSELECT F.subj, F.obj FROM

F,G,H,L,P WHERE F.subj=G.sbj AND F.obj=H.subj AND F.subj=L.subj

AND F.obj=P.subj AND H.obj=L.obj. We use right-deep hash-joins toprocess such a query. Consider the topmost join operator in ajoin tree for query graphQ. Its left operand is thebuild relationwhich is one of the two in-memory hash tables for an edgee.Its right operand is theprobe relationwhich is a hash table foranother edge or a join subtree forQ′=Q−e (i.e., the resultinggraph of removinge from Q). For instance, one possible jointree for the aforementioned query isG⊲⊳(F⊲⊳(P⊲⊳(H⊲⊳L))).With regard to its topmost join operator, the left operand isG’s hash table that usesG.sbj as the hash key, and the rightoperand is(F⊲⊳(P⊲⊳(H⊲⊳L))). The hash-join operator iteratesthrough tuples from the probe relation, finds matching tuplesfrom the build relation, and joins them to form answer tuples.

B. Best-first Exploration of Query Lattice

Given a query lattice, a brute-force approach is to evaluateall lattice nodes (query graphs) to find all answer tuples. Itsexhaustive nature leads to clear inefficiency, since we onlyseek top-k answers. Moreover, the potentially many queries areevaluated separately, without sharing of computation. Supposequery graphQ is evaluated by the aforementioned hash-joinbetween the build relation fore and the probe relation forQ′.By definition, Q′ is also a query graph in the lattice, ifQ′

is weakly connected and contains all query entities. In otherwords, in processingQ, we would have processed one of itschildren query graphQ′ in the lattice.

We propose Alg.2, which allows sharing of computation. Itexplores the query lattice in abottom-upway, starting withthe minimal query trees, i.e., the bottom nodes. After a querygraph is processed, its answers are materialized in files. Toprocess a queryQ, at least one of its childrenQ′=Q−e musthave been processed. The materialized results forQ′ form theprobe relation and a hash table one is the build relation.

While any topological order would work for the bottom-up exploration, Alg.2 employs abest-firststrategy that alwayschooses to evaluate the most promising lattice nodeQbest froma set of candidate nodes. The gist is to process the latticenodes in the order of their upper-bound scores andQbest isthe candidate with the highest upper-bound score (Line 3).If processingQbest does not yield any answer graph,Qbest

and all its ancestors are pruned (Line 6) and the upper-boundscores of other candidate nodes are recalculated (Line 7). Thealgorithm terminates, without fully evaluating all lattice nodes,when it has obtained at leastk answer tuples with scoreshigher than the highest possible upper-bound score among allunevaluated nodes (Line 10).

For an arbitrary query graphQ, its upper-bound score isgiven by the best possible scoreQ’s answer graphs can attain.Deriving such upper-bound score based onscoreQ(A) inEq. (5) leads to loose upper-bound.scoreQ(A) sums up thestructure score ofQ (s score(Q)) and the content score forAmatchingQ (c scoreQ(A)). While s score(Q) only dependson Q itself, c scoreQ(A) captures the matching nodes inAandQ. Without evaluatingQ to getA, we can only assumeperfect match(e, e′) in Eq. (5), which is clearly an over-optimism. Under such a loose upper-bound, it can be difficultto achieve an early termination of lattice evaluation.

To alleviate this problem,GQBE takes a two-stage ap-proach. Its query algorithm first finds the top-k′ answers(k′>k) based on the structure scores score(Q) only, i.e.,the algorithm uses a simplified answer graph scoring functionscoreQ(A) = s score(Q). In the second stage,GQBE re-ranks the top-k′ answers by the full scoring function Eq. (5)and returns the top-k answer tuples based on the new scores.Our experiments showed the best accuracy fork ranging from10 to 25 whenk′ was set to around100. Lesser values ofk′

lowered the accuracy and higher values increased the runningtime of the algorithm. In the ensuing discussion, we will notfurther distinctk′ andk.

Below we provide the algorithm details.

C. Details of the Best-first Exploration Algorithm

(1) Selecting Qbest

At any given moment during query lattice evaluation, thelattice nodes belong to three mutually-exclusive sets—theevaluated, the unevaluated and the pruned. A subset of theunevaluated nodes, denoted thelower-frontier (LF ), are can-didates for the node to be evaluated next. At the beginning,LFcontains only the minimal query trees (Line 1 of Alg.2). Aftera node is evaluated, all its parents are added toLF (Line 9).Therefore, the nodes inLF either are minimal query trees orhave at least one evaluated child:

LF = {Q| Q is not pruned,Children(Q)=∅ or(∃Q′ ∈ Children(Q) s.t.Q′ is evaluated)}.

To chooseQbest from LF , the algorithm exploits twoimportant properties, dictated by the query lattice’s structure.

9

Property 1 If Q1 ≺ Q2, then∀A2 ∈ AQ2, ∃A1 ∈ AQ1

s.t.A1 ≺ A2 and tA1

=tA2.

Proof If there exists an answer graphA2 for a query graphQ2, and there exists another query graphQ1 that is a subgraphof Q2, then there is a subgraph ofA2 that corresponds toQ1. By Definition 3, that corresponding subgraph ofA2 isan answer graph toQ1. Since the two answer graphs share asubsumption relationship, the projections of the two yieldthesame answer tuple.

Property 1 says, if an answer tupletA2is projected from

answer graphA2 to lattice nodeQ2, then every descendent ofQ2 must have at least one answer graph subsumed byA2 thatprojects to the same answer tuple. Putting it in an informalway, an answer tuple (graph) to a lattice node can always be“grown” from its descendant nodes and thus ultimately fromthe minimal query trees.

Property 2 If Q1≺Q2, thens score(Q1)<s score(Q2).

Proof If Q1 ≺ Q2, thenQ2 contains all edges inQ1 and atleast one more. Thus the property holds by the definition ofs score(Q) in Eq. (5).

Property 2 says that, if a lattice nodeQ2 is an ancestorof Q1, Q2 has a higher structure score. This can be directlyproved by referring to the definition ofs score(Q) in Eq. (5).

For each unevaluated candidate nodeQ in LF , we define anupper-bound score, which is the best scoreQ’s answer tuplescan possibly attain. The chosen node,Qbest, must have thehighest upper-bound score among all the nodes inLF . Bythe two properties, if evaluatingQ returns an answer graphA, A has the potential to grow into an answer graphA′ to anancestor nodeQ′, i.e.,Q≺Q′ andA≺A′. In such a case,A andA′ are projected to the same answer tupletA=tA′ . The answertuple always gets the better score fromA′, under the simplifiedanswer scoring functionscoreQ(A) = s score(Q), whichAlg.2 adopts as mentioned in Sec. V-B. Hence,Q’s upper-bound score depends on itsupper boundary— Q’s unprunedancestors that have no unpruned parents.

Definition 8 The upper boundary of a node Q in LF ,denotedUB(Q), consists of nodesQ′ in the upper-frontier(UF ) that subsume or equal toQ:

UB(Q) = {Q′| Q′ � Q,Q′ ∈ UF},where

UF is the set of unpruned nodes without unpruned parents:UF={Q| Q is not pruned, ∄Q′ ≻ Q s.t.Q′ is not pruned}.

Definition 9 The upper-bound score of a nodeQ is themaximum score of any query graph in its upper boundary:

U(Q) = maxQ′∈UB(Q)

s score(Q′) (9)

Example 5 (Lattice Evaluation) Consider the lattice inFig.10(a) where the lightly shaded nodes belong to theLF andthe darkly shaded node belongs toUF . At the beginning, onlythe minimal query trees belong to theLF and the maximal

Algorithm 2: Best-first Exploration of Query LatticeInput : query latticeL, query tuplet, and an integerkOutput : top-k answer tuples

1 lower frontierLF ← leaf nodes ofL; Terminate← false;2 while not Terminatedo3 Qbest ← node with the highest upper-bound score inLF ;4 AQbest

← evaluateQbest; (Sec.V-A)5 if AQbest

=∅ then6 pruneQbest and all its ancestors fromL;7 recompute upper-bound scores of nodes inLF (Alg. 3);8 else9 insertParents(Qbest) into LF ;

10 if top-k answer tuples found [Theorem 4]then Terminate←true

Fig. 10. Evaluating Lattice in Figure 9(b)

Fig. 11. Recomputing Upper Boundary of Dirty NodeFG

query graph belongs to theUF . If HL is chosen asQbest andevaluating it results in matching answer graphs, all its parents(GHL, HLP andFHL) are added toLF as shown in Fig.10(b).The evaluated nodeHL is represented in bold dashed node.

(2) Pruning and Lattice RecomputationA lattice node that does not have any answer graph is

referred to as anull node. If the most promising nodeQbest

turns out to be a null node after evaluation, all its ancestorsare also null nodes based on Property 3 below which followsdirectly from Property 1.

Property 3 (Upward Closure) If AQ1= ∅, then∀Q2 ≻ Q1,

AQ2= ∅.

Proof Suppose there is a query nodeQ2 such thatQ1 ≺ Q2

and AQ1= ∅, while AQ2

6= ∅. By Property 1, for everyanswer graphA in AQ2

, there must exist a subgraph ofAthat belongs toAQ1

. This is a contradiction and completesthe proof.

Based on Property 3, whenQbest is evaluated to be a nullnode, Alg.2 prunesQbest and its ancestors, which changes theupper-frontierUF . It is worth noting thatQbest itself may bean upper-frontier node, in which case onlyQbest is pruned. Ingeneral, due to the evaluation and pruning of nodes,LF and

10

Algorithm 3: Recomputing Upper-bound ScoresInput : query latticeL, null nodeQbest, and lower-frontierLFOutput : U(Q) for all Q in LF

1 foreach Q ∈ LF do2 NB ← φ; // set of new upper boundary candidates ofQ.3 foreach Q′ ∈ UB(Q) ∩ UB(Qbest) do4 UB(Q)← UB(Q) \ {Q′};5 UF ← UF \ {Q′};6 V (Q′′)← V (Q′);7 foreach e ∈ E(Qbest) \ E(Q) do8 E(Q′′)← E(Q′) \ {e};9 find Qsub, the weakly-connected component ofQ′′,

containing all query entities;10 NB ← NB ∪ {Qsub};

11 foreach Qsub ∈ NB do12 if Qsub ⊀ (any node inUF or NB) then13 UB(Q)← UB(Q) ∪ {Qsub}, UF ← UF ∪ {Qsub};

14 recomputeU(Q) using Eq. (9);

Q'

Q Qbest

Fig. 12. Venn Diagram of Edges

UF might overlap. For nodes inLF that have at least oneupper boundary node among the pruned ones, the change ofUF leads to changes in their upper boundaries and, sometimes,their upper-bound scores too. We refer to such nodes asdirtynodes. The rest of this section presents an efficient method(Alg. 3) to recompute the upper boundaries, and if changed,the upper-bound scores of the dirty nodes.

Consider all the pairs〈Q,Q′〉 such thatQ is a dirty node inLF , andQ′ is one of its pruned upper boundary nodes. Threenecessary conditions for a new candidate upper boundary nodeof Q are that it is (1) a supergraph ofQ, (2) a subgraphof Q′ and (3) not a supergraph ofQbest. The subsumptionrelationships among these graphs can be visualized in a Venndiagram, as shown in Fig.12. If there areq edges inQbest

but not inQ (the non-intersecting region ofQbest in Fig.12),we create a set ofq distinct graphsQ′′. EachQ′′ contains alledges inQ′ except exactly one of the aforementionedq edges(Line 8 in Alg. 3). For eachQ′′, we find Qsub which is theweakly connected component ofQ′′ containing all the queryentities (Lines 9-10). Lemma 1 and 2 show thatQsub must beone of the unevaluated nodes after pruning the ancestor nodesof Qbest from L.

Lemma 1 Qsub is a query graphand it does not belong tothe pruned nodes of latticeL.

Proof Qsub is a query graph because it is weakly connectedand it contains all the input entities. SupposeQsub is a newlygenerated candidate upper boundary node from pair〈Q,Q′〉and Qsub belongs to the pruned nodes of latticeL. This

can happen only due to one of the two reasons: 1) it isa supergraph of the current null nodeQbest or 2) it is analready pruned node. The former cannot happen since theconstruction mechanism ofQsub proposed ensures that it isnot a supergraph ofQbest. the latter implies thatQsub wasthe supergraph of an previously evaluated null node (orQsub

itself was a null node). In this case, sinceQsub ≺ Q′, Q′

would also have been pruned and thus could not have beenpart of the upper-boundary. Hence〈Q,Q′〉 cannot be a validpair for recomputing the upper boundary ifQsub is a prunednode. This completes the proof.

Lemma 2 Q � Qsub.

Proof Based on Alg. 3 described above,Q′′ is the result ofdeleting one edge fromQ′ and that edge does not belong toQ. Therefore,Q is subsumed byQ′′. By the same algorithm,Qsub is the weakly connected component ofQ′′ that containsall the query entities. SinceQ already is weakly connectedand contains all the query entities,Qsub must be a supergraphof Q.

If Qsub (a candidate new upper boundary node ofQ) is notsubsumed by any node in the upper-froniter or other candidatenodes, we addQsub to UB(Q) andUF (Lines 11-13). Finally,we recomputeQ’s upper-bound score (Line 14). Theorem 3justifies the correctness of the above procedure.

Theorem 3 If Qbest is evaluated to be a null node, then Alg.3identifies all new upper boundary nodes for every dirty nodeQ.

Proof For any dirty nodeQ, its original upper boundaryUB(Q) consists of two sets of nodes: (1) nodes that are notsupergraphs ofQbest and thus remain in the lattice, (2) nodesthat are supergraphs ofQbest and thus are pruned. By thedefinition of upper boundary node, no upper boundary nodeof Q can be a subgraph of any node in set (1). So any newupper boundary node ofQ must be a subgraph of a nodeQ′ inset (2). For every pruned upper boundary nodeQ′ in set (2),the algorithm enumerates all (specificallyq) possible childrenof Q′ that are not supergraphs ofQbest but are supergraphs ofQ. For each enumerated graphQ′′, the algorithm findsQsub,which is the weakly connected component ofQ′′ containingall query entities. Thus all new upper boundary nodes ofQare identified.

Example 6 (Recomputing Upper Boundary) Consider thelattice in Fig.11(a) where nodesHL and F are the evaluatednodes and the lightly shaded nodes belong to the newLF .If node GHL is the currently evaluated null nodeQbest andFGHLP is Q′, let FG be the dirty nodeQ whose upperboundary is to be recomputed. The edges inQbest that are notpresent inQ are H and L. A new upper boundary nodeQ′′

contains all edges inQ′ excepting exactly eitherH or L. Thisleads to two new upper boundary nodes,FGHP andFGLP, byremovingL and H from FGHLP, respectively. SinceFGHPandFGLP do not subsume each other and are not subgraphs

11

of any other upper-frontier node, they are now part ofUB(Q)and the newUF . Fig.11(b) shows the modified lattice wherethe pruned nodes are disconnected.FHLP is another node inUF that is discovered using dirty nodes such asFL andHLP.

Complexity Analysis of Alg.3 The query graphs corre-sponding to lattice nodes are represented using bit vectorssince we exactly know the edges involved in all the querygraphs. The bit corresponding to an edge is set if its presentin the query graph. Identifying the dirty nodes, null upperboundary nodes and building a new potential upper boundarynode using a pair of nodes〈Q,Q′〉, can be accomplishedusing bit operations and each step incursO(|E(MQGt)|)time. Finding the weakly connected component of a potentialupper boundary using DFS takesO(|E(Q′)|) time. If Ln isthe set of all null nodes encountered in the lattice and thereareDp such pairs for every null node andq is the average numberof potential new upper boundary nodes created per pair, theworst case time complexity of recomputing the upper-frontieris O(|Ln|×Dp × q × |E(MQGt)|). Our experimental resultsshow low average values of|Ln|, Dp and q with |Ln| beingonly 1% of |L|, Dp around 8 andq around 9. In practice, ourupper-frontier recomputation algorithm quickly computesthedynamically changing lattice.

(3) Termination

After Qbest is evaluated, its answer tuples are{tA|A∈AQbest

}. For a tA projected from answer graphA, the score assigned byQbest to A (and thus tA) iss score(Qbest), based on scoreQ(A)=s score(Q)—thesimplified scoring function adopted by Alg.2. IftA was alsoprojected from already evaluated nodes, it has a current score.By Def.4, the final score oftA will be from its best answergraph. Hence, ifs score(Qbest) is higher than its currentscore, then its score is updated. In this way, all found answertuples so far are kept and their current scores are maintainedto be the highest scores they have received. The algorithmterminates when the current score of thekth best answertuple so far is greater than the upper-bound score of the nextQbest chosen by the algorithm, by Theorem 4.

Theorem 4 If the score of the currentkth best answer tuple isgreater thanU(Qbest), then terminating the lattice evaluationguarantees that the current top-k answer tuples have scoreshigher thans score(Q) for any unevaluated query graphQ.

Proof Suppose, upon termination, there is an unevaluatedquery graphQ whoses score(Q) is greater than the scoreof the kth answer tuple. This implies that there exists somenode in the lower-frontier whose upper-bound score is at leasts score(Q) and is thus greater than the score of thekth

answer tuple. Since the termination condition precludes this,it is a contradiction to the initial assumption. We thus cannothave any unevaluated query graph whose structure score isgreater than thekth answer tuple’s score upon termination.

Query Query Tuple Table Size

F1 〈Donald Knuth, Stanford University, Turing Award〉 18F2 〈Ford Motor, Lincoln, Lincoln MKS〉 25F3 〈Nike, Tiger Woods〉 20F4 〈Michael Phelps, Sportsman of the Year〉 55F5 〈Gautam Buddha, Buddhism〉 621F6 〈Manchester United, Malcolm Glazer〉 40F7 〈Boeing, Boeing C-22〉 89F8 〈David Beckham, A. C. Milan〉 94F9 〈Beijing, 2008 Summer Olympics〉 41F10 〈Microsoft, Microsoft Office〉 200F11 〈Jack Kirby, Ironman〉 25F12 〈Apple Inc, Sequoia Capital〉 300F13 〈Beethoven, Symphony No. 5〉 600F14 〈Uranium, Uranium-238〉 26F15 〈Microsoft Office, C++〉 300F16 〈Dennis Ritchie, C〉 163F17 〈Steven Spielberg, Minority Report〉 40F18 〈Jerry Yang, Yahoo!〉 8349F19 〈C〉 1240F20 〈TomKat〉 16D1 〈Alan Turing, Computer Scientist〉 52D2 〈David Beckham, Manchester United〉 273D3 〈Microsoft, Microsoft Excel〉 300D4 〈Steven Spielberg, Catch Me If You Can〉 37D5 〈Boeing C-40 Clipper, Boeing〉 118D6 〈Arnold Palmer, Sportsman of the year〉 251D7 〈Manchester City FC, Mansour bin Zayed Al Nahyan〉 40D8 〈Bjarne Stroustrup, C++〉 964

TABLE IQUERIES AND GROUND TRUTH TABLE SIZE

VI. EXPERIMENTS

This section presents our experiment results on the accuracyand efficiency ofGQBE. The experiments were conducted ona double quad-core24 GB Memory2.0 GHz Xeon server.

Datasets The experiments were conducted over two largereal-world knowledge graphs, the Freebase [4] and DBpe-dia [3] datasets. We preprocessed the graphs so that thekept nodes are all named entities (e.g.,Stanford University) andabstract concepts (e.g.,Jewish people). The resulting Freebasegraph contains28M nodes,47M edges, and5, 428 distinctedge labels. The DBpedia graph contains759K nodes,2.6Medges and9, 110 distinct edge labels.

Methods Compared GQBE was compared with aBaselineand NESS [18]. NESS is a graph querying framework thatfinds approximate matches of query graphs with unlabelednodes which correspond to query entity nodes in MQG. Notethat, like other systems,NESS must take a query graph (insteadof a query tuple) as input. Hence, we feed the MQG discoveredby GQBE as the query graph toNESS. For each nodev inthe query graph, a set of candidate nodes in the data graph areidentified. Since,NESS does not consider edge-labeled graphs,we adapted it by requiring each candidate nodev′ of v to haveat least one incident edge in the data graph bearing the samelabel of an edge incident onv in the query graph. The score ofa candidatev′ is the similarity between the neighborhoods ofvandv′, represented in the form of vectors, and further refinedusing an iterative process. Finally, one unlabeled query nodeis chosen as the pivotp. The top-k candidates for multipleunlabeled query nodes are put together to form answer tuples,if they are within the neighborhood ofp’s top-k candidates.Similar to the best-first method (Sec.V),Baseline explores aquery lattice in a bottom-up manner and prunes ancestorsof null nodes. However, differently, it evaluates the lattice

12

Query Tuple Top-3 Answer Tuples

〈D. Knuth, Stanford, V. Neumann Medal〉〈Donald Knuth, Stanford, Turing Award〉〈J. McCarthy, Stanford, Turing Award〉

〈N. Wirth, Stanford, Turing Award〉〈David Filo, Yahoo!〉

〈Jerry Yang, Yahoo!〉〈Bill Gates, Microsoft〉〈Steve Wozniak, Apple Inc.〉

〈Java〉〈C〉〈C++〉

〈C Sharp〉

TABLE IICASE STUDY: TOP-3 RESULTS FORSELECTED QUERIES

by breadth-first traversal instead of in the order of upper-bound scores. There is no early-termination by top-k scores,asBaseline terminates when every node is either evaluated orpruned. We implementedGQBE andBaseline in Java and weobtained the source code ofNESS from the authors.

Queries and Ground Truth Two groups of queriesare used on the two datasets, respectively. The Freebasequeries F1– F20 are based on Freebase tables such ashttp://

www.freebase.com/view/computer/programming language designer?instances,except F1 and F6 which are from Wikipedia tables suchas http://en.wikipedia.org/wiki/List of English football club owners. TheDBpedia queries D1– D8 are based on DBpedia tablessuch as the values for propertyis dbpedia-owl:author of on pagehttp://dbpedia.org/page/Microsoft. Each such table is a collection oftuples, in which each tuple consists of one, two, or threeentities. For each table, we used one or more tuples asquery tuples and the remaining tuples as the ground truth forquery answers. All the28 queries and their correspondingtable sizes are summarized in Table I. They cover diversedomains, including people, companies, movies, sports,awards, religions, universities and automobiles.Sample Answers Table II only lists the top-3 results foundbyGQBE for 3 queries (F1, F18, F19), due to space limitations.(A) Accuracy Based on Ground Truth

We measured the accuracy ofGQBE and NESS by com-paring their query results with the ground truth. The accuracyon a set of queries is the average of accuracy on individualqueries. The accuracy on a single query is captured by threewidely-used measures [23], as follows.• Precision-at-k (P@k): the percentage of the top-k resultsthat belong to the ground truth.

• Mean Average Precision (MAP): The average precision of

the top-k results is AvgP=∑k

i=1P@i × reli

size of ground truth, wherereli equals1 if the result at ranki is in the ground truth and0 otherwise.MAP is the mean of AvgP for a set of queries.

• Normalized Discounted Cumulative Gain (nDCG):The cumulative gain of the top-k results isDCGk=rel1+

∑ki=2

relilog

2(i) . It penalizes the results if a

ground truth result is ranked low. DCGk is normalized byIDCGk, the cumulative gain for an ideal ranking of thetop-k results. Thus nDCGk= DCGk

IDCGk.

Fig.13 shows these measures for different values ofk onthe Freebase queries.GQBE has high accuracy. For instance,its P@25 is over0.8. The absolute value of MAP is not high,merely because Fig.13(b) only shows the MAP for at most top-25 results, while the ground truth size (i.e., the denominatorin

0 0.2 0.4 0.6 0.8

1

10 15 20 25

Precis

ion

-at-

K

K

GQBE NESS(a) P@k

0.1

0.2

0.3 0.35

10 15 20 25

MA

P

K

GQBE NESS(b) MAP

0 0.2 0.4 0.6 0.8

1

10 15 20 25

nD

CG

K

GQBE NESS(c) nDCG

Fig. 13. Accuracy ofGQBE andNESS on Freebase Queries

Query P@k nDCG AvgP Query P@k nDCG AvgP

D1 1.00 1.00 0.20 D2 1.00 1.00 0.04D3 1.00 1.00 0.03 D4 0.80 0.94 0.19D5 0.90 1.00 0.08 D6 1.00 1.00 0.04D7 0.90 0.98 0.22 D8 1.00 1.00 0.01

TABLE IIIACCURACY OFGQBE ON DBPEDIA QUERIES, k=10

calculating MAP) for many queries is much larger. Moreover,GQBE outperformsNESS substantially, as its accuracy in allthree measures is almost always twice as better. This is becauseGQBE gives priority to query entities and important edges inMQG, while NESS gives equal importance to all nodes andedges except the pivot. Furthermore, the wayNESS handlesedge labels does not explicitly require answer entities to beconnected by the same paths between query entities.

Table III further shows the accuracy ofGQBE on individualDBpedia queries atk=10. It exhibits high accuracy on allqueries, including perfect precision in several cases.

(B) Accuracy Based on User StudyWe conducted an extensive user study through Amazon

Mechanical Turk (MTurk,https://www.mturk.com/mturk/) to evaluateGQBE’s accuracy on Freebase queries, measured by PearsonCorrelation Coefficient (PCC). For each of the20 queries, weobtained the top-30 answers fromGQBE and generated50random pairs of these answers. We presented each pair to20MTurk workers and asked for their preference between the twoanswers in the pair. Hence, in total,20, 000 opinions wereobtained. We then constructed two value lists per query,XandY , which representGQBE and MTurk workers’ opinions,respectively. Each list has50 values, for the50 pairs. Foreach pair, the value inX is the difference between thetwo answers’ ranks given byGQBE, and the value inYis the difference between the numbers of workers favoringthe two answers. The PCC value for a query is(E(XY ) −E(X)E(Y ))/(

√

E(X2)− (E(X))2√

E(Y 2)− (E(Y ))2). Thevalue indicates the degree of correlation between the pairwiseranking orders produced byGQBE and the pairwise pref-erences given by MTurk workers. The value range is from−1 to 1. A PCC value in the ranges of [0.5,1.0], [0.3,0.5)and [0.1,0.3) indicates a strong, medium and small positivecorrelation, respectively [7]. PCC is undefined, by definition,whenX and/orY contain all equal values.

Table IV shows the PCC values for F1-F20. Out of the20queries,GQBE attained strong, medium and small positivecorrelation with MTurk workers on9, 5 and 3 queries,respectively. Only query F7 shows no correlation. Note thatPCC is undefined for F12 and F13, because all the top-30answer tuples have the same score and thus the same rank,

13

Query Tuple1 Tuple2 Combined (1,2) Tuple3 Combined (1,2,3)

P@k nDCG AvgP P@k nDCG AvgP P@k nDCG AvgP P@k nDCG AvgP P@k nDCG AvgPF1 0.36 0.76 0.32 0.36 1.00 0.50 0.12 0.38 0.02 0.36 0.73 0.22 0.12 0.49 0.02F2 0.76 1.00 0.79 0.00 0.00 0.00 0.80 1.00 0.80 0.12 0.70 0.05 0.80 1.00 0.91F4 0.32 0.73 0.09 0.40 0.65 0.08 1.00 1.00 0.45 N/A N/A N/A N/A N/A N/AF6 0.24 0.89 0.16 0.28 0.89 0.18 0.40 0.87 0.16 0.36 0.98 0.22 0.12 0.94 0.07F8 0.92 0.79 0.20 1.00 1.00 0.27 0.96 0.98 0.24 0.48 0.86 0.08 1.00 1.00 0.27F9 0.68 0.72 0.23 0.56 0.66 0.17 0.80 0.86 0.35 1.00 1.00 0.62 1.00 1.00 0.66F17 0.32 1.00 0.33 0.64 0.83 0.25 0.32 1.00 0.32 0.56 0.84 0.23 0.68 1.00 0.46

TABLE VACCURACY OFGQBE ON MULTI -TUPLE QUERIES, k=25

1

10

100

1000

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20

Qu

ery

Pro

cess

ing

Tim

e (

secs

.)

Query

GQBE NESS Baseline

12 13 18 10 8 10 8 12 8 8 11 9 7 11 8 9 9710 7# edges

in MQG

Fig. 14. Query Processing Time

0

40

80

120

160

200

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17 F18 F19 F20

# o

f La

ttic

e N

od

es

Ev

alu

ate

d

Query

GQBE Baseline840

12 13 18 10 8 10 8 12 8 8 11 9 7 11 8 9 7 9710# edges

in MQG

Fig. 15. Lattice Nodes Evaluated

Query PCC Query PCC Query PCC Query PCCF1 0.79 F2 0.78 F3 0.60 F4 0.80F5 0.34 F6 0.27 F7 0.06 F8 0.26F9 0.33 F10 0.77 F11 0.58 F12 undefinedF13 undefined F14 0.62 F15 0.43 F16 0.29F17 0.64 F18 0.30 F19 0.40 F20 0.65

TABLE IVPEARSONCORRELATION COEFFICIENT(PCC)BETWEENGQBE AND

AMAZON MTURK WORKERS, k=30

resulting in all zero values inX , i.e.,GQBE’s list.

(C) Accuracy on Multi-tuple QueriesWe investigated the effectiveness of the multi-tuple querying

approach (Sec.III-D). In aforementioned single-tuple queryexperiment (A),GQBE attained perfect P@25 for 13 of the20Freebase queries. We thus focused on the remaining7 queries.For each query, Tuple1 refers to the query tuple in Table I,while Tuple2 and Tuple3 are two tuples from its ground truth.Table V shows the accuracy of top-25 GQBE answers forthe three tuples individually, as well as for the first two andthree tuples together by merged MQGs, which are denotedCombined(1,2) and Combined(1,2,3), respectively. F4 attainedperfect precision after Tuple2 was included. Therefore, Tuple3was not used for F4. The results show that, in most cases,Combined(1,2) had better accuracy than individual tuples andCombined(1,2,3) further improved accuracy.

(D) Efficiency ResultsWe compared the efficiency ofGQBE, NESS andBaseline

on Freebase queries. The total run time for a query tupleis spent on two components—query graph discovery andquery processing. Fig.14 compares the three methods’ queryprocessing time, in logarithmic scale. The figure shows thequery processing time for each of the20 Freebase queries,and the edge cardinality of the MQG for each of those isshown below the corresponding query id. The query cost doesnot appear to increase by edge cardinality, regardless of thequery method. ForGQBE andBaseline, this is because querygraphs are evaluated by joins and join selectivity plays a

Query MQG1 MQG2 Merge Query MQG1 MQG2 MergeF1 73.141 73.676 0.034 F2 0.049 0.029 0.006F3 12.566 4.414 0.024 F4 5.731 7.083 0.024F5 9.982 2.522 0.079 F6 6.082 4.654 0.039F7 0.152 0.107 0.007 F8 10.272 2.689 0.032F9 62.285 2.384 0.041 F10 2.910 5.933 0.030F11 59.541 65.863 0.032 F12 1.977 0.021 0.006F13 9.481 5.624 0.034 F14 0.038 0.015 0.004F15 0.154 5.143 0.021 F16 54.870 6.928 0.057F17 60.582 69.961 0.041 F18 58.807 75.128 0.053F19 0.224 0.076 0.003 F20 0.025 0.017 0.002

TABLE VIT IME FOR DISCOVERING AND MERGINGMQGS (SECS.)

more significant role in evaluation cost than number of edges.NESS finds answers by intersecting postings lists on featurevectors. Hence, in evaluation cost, intersection size mattersmore than edge cardinality.GQBE outperformedNESS on 17of the 20 queries and was more than3 times faster in10 ofthem. It finished within10 seconds on17 queries. However,it performed very poorly on F4 and F19, which have10 and7 edges respectively. This indicates that the edges in the twoMQGs lead to poor join selectivity.Baseline clearly suffered,due to its inferior pruning power compared to the best-firstexploration employed byGQBE. This is evident in Fig.15which shows the numbers of lattice nodes evaluated for eachquery.GQBE evaluated considerably less nodes in most casesand at least2 times less on11 of the 20 queries.

MQG discovery precedes the query processing step andis shared by all three methods. Column MQG1 in Table VIlists the time spent on discovering MQG for each Freebasequery. This time component varies across individual queries,depending on the sizes of query tuples’ neighborhood graphs.Compared to the values shown in Fig.14, the time taken todiscover an MQG in average is comparable to the time spentby GQBE in evaluating it.

Fig.16 shows the distribution of theGQBE’s query process-ing time, in logarithmic scale, on the merged MQGs of 2-tuplequeries in Table V, denoted by Combined(1,2). It also showsthe distribution of the total time for evaluating the two tuples’

14

1.0E-2

1.0E-1

1.0E+0

1.0E+1

1.0E+2

1.0E+3

1.0E+4

Qu

ery

Pro

cess

ing

Tim

e (

secs

.)

Combined (1,2) Tuple1+Tuple2

Fig. 16. Query Processing Time of2-tuple Queries

MQGs individually, denoted Tuple1+Tuple2. Combined(1,2)processes10 of the20 queries in less than a second while thefastest query for Tuple1+Tuple2 takes a second. This suggeststhat the merged MQGs gave higher weights to more selectiveedges, resulting in faster lattice evaluation. Meanwhile,theseselective edges are also more important edges common tothe two query tuples, leading to improved answer accuracyas shown in Table V. Table VI further shows the time takento discover MQG1 and MQG2 for the two tuples, along withthe time for merging them. The latter is negligible comparedto the former.

VII. R ELATED WORK

Lim et al. [21] use example tuples to find similar tuples indatabase tables that are coupled with ontologies. They do notdeal with graph data and the example tuples are not formedby entities.

The goal of set expansionis to grow a set of objectsstarting from seed objects. Example systems include [30],[11], and the now defunctGoogle Sets andSquared services(http://en.wikipedia.org/wiki/List of Google products). Chang et al. [5]identify top-k correlated keyword terms from an informationnetwork given a set of terms, where each term can be anentity. These systems, except [5], do not operate on datagraphs. Instead, they find existing answers within structuresin web pages such as HTML tables and lists. Furthermore, allthese systems exceptGoogle Squared and [11] take a set ofindividual entities as input.GQBE is more general in that eachquery tuple contains multiple entities.

Several works [28], [17], [9] identify the best sub-graphs/paths in a data graph to describe how several inputnodes are related. The query graph discovery component ofGQBE is different in important ways– (1) The graphs in [28]contain nodes of the same type and edges representing thesame relationship, e.g., social networks capturing friendshipbetween people. The graphs inGQBE and others [17], [9]have many different types of entities and relationships. (2)The paths discovered by their techniques only connect theinput nodes.REX [9] has the further limitation of allowingonly two input entities. Differently the maximal query graphin GQBE includes edges incident on individual query entities.(3) GQBE uses the discovered query graph to find answergraphs and answer tuples, which is not within the focus of theaforementioned works.

There are many studies on approximate/inexact subgraphmatching in large graphs, such asG-Ray [29], TALE [27] and

NESS [18]. GQBE’s query processing component is differentfrom them on several aspects. First,GQBE only requiresto match edge labels and matching node identifiers is notmandatory. This is equivalent to matching a query graphwith all unlabeled nodes and thereby significantly increasesthe problem complexity. Only a few previous methods (e.g.,NESS [18]) allow unlabeled query nodes. Second, inGQBE,the top-k query algorithm centers around query entities. Morespecifically, the weighting function gives more importanceto query entities and the minimal query trees mandate thepresence of entities corresponding to query entities. On thecontrary, previous methods give equal importance to all nodesin a query graph, since the notion of query entity does notexist there. Our empirical results show that this differencemakes NESS produce less accurate answers thanGQBE.Finally, although the query relaxation DAG proposed in [2]is similar toGQBE’s query lattice, the scoring mechanism oftheir relaxed queries is different and depends on XML-basedrelaxations.

VIII. C ONCLUSION

We introduceGQBE, a system that queries knowledgegraphs by example entity tuples. As an initial step towardbetter usability of graph query systems,GQBE saves usersthe burden of forming explicit query graphs. To the best ofour knowledge, there has been no such proposal in the past.Its query graph discovery component derives a hidden querygraph based on example tuples. The query lattice based onthis hidden graph may contain a large number of query graphs.GQBE’s query algorithm only partially evaluates query graphsfor obtaining the top-k answers. Experiments on Freebase andDBpedia datasets show thatGQBE outperforms the state-of-the-art systemNESS on both accuracy and efficiency.

IX. A CKNOWLEDGEMENTS

The authors would like to thank Mahesh Gupta for hisvaluable contributions in performing the experiments.

REFERENCES

[1] D. J. Abadi, A. Marcus, S. Madden, and K. J. Hollenbach. Scalablesemantic web data management using vertical partitioning.In VLDB’07.

[2] S. Amer-Yahia, N. Koudas, A. Marian, D. Srivastava, and D. Toman.Structure and content scoring for xml. InVLDB, 2005.

[3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak,and Z. Ives.DBpedia: A nucleus for a Web of open data. InISWC, 2007.

[4] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: acollaboratively created graph database for structuring human knowledge.In SIGMOD, pages 1247–1250, 2008.

[5] L. Chang, J. X. Yu, L. Qin, Y. Zhu, and H. Wang. Finding informationnebula over large networks. InCIKM, 2011.

[6] D. H. Chau, C. Faloutsos, H. Tong, J. I. Hong, B. Gallagher, andT. Eliassi-Rad. GRAPHITE: A visual query system for large graphs.In ICDM Workshops, pages 963–966, 2008.

[7] J. Cohen. Statistical Power Analysis for the Behavioral Sciences.Lawrence Erlbaum Associates, 1988.

[8] E. Demidova, X. Zhou, and W. Nejdl. FreeQ: an interactivequeryinterface for Freebase. InWWW, demo paper, 2012.

[9] L. Fang, A. D. Sarma, C. Yu, and P. Bohannon. REX: explainingrelationships between entity pairs. InPVLDB, pages 241–252, 2011.

[10] H. N. Gabow and E. W. Myers. Finding all spanning trees ofdirectedand undirected graphs.SIAM J. Comput., 7(3):280–287, 1978.

15

[11] R. Gupta and S. Sarawagi. Answering table augmentationqueries fromunstructured lists on the web. InVLDB, pages 289–300, 2009.

[12] H. V. Jagadish, A. Chapman, A. Elkiss, M. Jayapandian, Y. Li, A. Nandi,and C. Yu. Making database systems usable. InSIGMOD, 2007.

[13] M. Jarrar and M. D. Dikaiakos. A query formulation language for thedata web.TKDE, 24:783–798, 2012.

[14] N. Jayaram, M. Gupta, A. Khan, C. Li, X. Yan, and R. Elmasri. GQBE:Querying knowledge graphs by example entity tuples. InICDE (demodescription), 2014 (to appear).

[15] C. Jin, S. S. Bhowmick, X. Xiao, J. Cheng, and B. Choi. GBLENDER:Towards blending visual query formulation and query processing ingraph databases. InSIGMOD, pages 111–122, 2010.

[16] M. Kargar and A. An. Keyword search in graphs: Finding r-cliques.PVLDB, pages 681–692, 2011.

[17] G. Kasneci, S. Elbassuoni, and G. Weikum. MING: mining informativeentity relationship subgraphs. InCIKM, 2009.

[18] A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, and S. Tao. Neigh-borhood based fast graph search in large networks. InSIGMOD’11.

[19] A. Khan, Y. Wu, and X. Yan. Emerging graph queries in linked data.In ICDE, pages 1218–1221, 2012.

[20] Z. Li, S. Zhang, X. Zhang, and L. Chen. Exploring the constrainedmaximum edge-weight connected graph problem.Acta MathematicaeApplicatae Sinica, 25:697–708, 2009.

[21] L. Lim, H. Wang, and M. Wang. Semantic queries by example.In Proceedings of the 16th International Conference on ExtendingDatabase Technology (EDBT 2013), 2013.

[22] Y. Luo, X. Lin, W. Wang, and X. Zhou. Spark: top-k keywordquery inrelational databases. InSIGMOD, 2007.

[23] C. D. Manning, P. Raghavan, and H. Schtze.Introduction to InformationRetrieval. Cambridge University Press, NY, USA, 2008.

[24] J. Pound, I. F. Ilyas, and G. E. Weddell. Expressive and flexible accessto web-extracted data: a keyword-based structured query language. InSIGMOD, pages 423–434, 2010.

[25] F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: a core ofsemanticknowledge unifying WordNet and Wikipedia. InWWW, 2007.

[26] Y. Sun, J. Han, X. Yan, P. S. Yu, , and T. Wu. PathSim: Meta path-basedtop-k similarity search in heterogeneous information networks. VLDB,2011.

[27] Y. Tian and J. M. Patel. TALE: A tool for approximate large graphmatching. InICDE, pages 963–972, 2008.

[28] H. Tong and C. Faloutsos. Center-piece subgraphs: Problem definitionand fast solutions. InKDD, pages 404–413, 2006.

[29] H. Tong, C. Faloutsos, B. Gallagher, and T. Eliassi-Rad. Fast best-effortpattern matching in large attributed graphs.KDD, 2007.

[30] R. C. Wang and W. W. Cohen. Language-independent set expansion ofnamed entities using the web. InICDM, pages 342–350, 2007.

[31] W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: a probabilistictaxonomy for text understanding. InSIGMOD, pages 481–492, 2012.

[32] M. Yahya, K. Berberich, S. Elbassuoni, M. Ramanath, V. Tresp, andG. Weikum. Deep answers for naturally asked questions on theweb ofdata. InWWW, demo paper, pages 445–449, 2012.

[33] J. Yao, B. Cui, L. Hua, and Y. Huang. Keyword query reformulation onstructured data.ICDE, pages 953–964, 2012.

[34] X. Yu, Y. Sun, P. Zhao, and J. Han. Query-driven discovery of seman-tically similar substructures in heterogeneous networks.In KDD’12.

[35] M. M. Zloof. Query by example. InAFIPS, 1975.

16

Documents

Querying Knowledge Graphs by Example Entity Tuples · Our work is the ﬁrst to query knowledge graphs by example entity tuples. The paradigm of query-by-example(QBE) has a long history