- Home
- Documents
- Neighborhood Based Fast Graph Search in Large propagation model that is able to convert a large net-work into a set of ... graph similarity measures such as subgraph ... and database schema matching.

Published on

04-May-2018View

214Download

2

Transcript

Neighborhood Based Fast Graph Search in Large Networks

Arijit KhanDept. of Computer Science

University of CaliforniaSanta Barbara, CA 93106

arijitkhan@cs.ucsb.edu

Nan LiDept. of Computer Science

University of CaliforniaSanta Barbara, CA 93106

nanli@cs.ucsb.edu

Xifeng YanDept. of Computer Science

University of CaliforniaSanta Barbara, CA 93106

xyan@cs.ucsb.edu

Ziyu GuanDept. of Computer Science

University of CaliforniaSanta Barbara, CA 93106

ziyuguan@cs.ucsb.edu

Supriyo ChakrabortyDept. of Electrical Engineering

University of CaliforniaLos Angeles, CA 90095

supriyo@ee.ucla.edu

Shu TaoIBM T. J. Watson19 Skyline Drive

Hawthorne, NY 10532shutao@us.ibm.com

ABSTRACTComplex social and information network search becomes impor-tant with a variety of applications. In the core of these applications,lies a common and critical problem: Given a labeled network anda query graph, how to efficiently search the query graph in the tar-get network. The presence of noise and the incomplete knowledgeabout the structure and content of the target network make it unre-alistic to find an exact match. Rather, it is more appealing to findthe top-k approximate matches.

In this paper, we propose a neighborhood-based similarity mea-sure that could avoid costly graph isomorphism and edit distancecomputation. Under this new measure, we prove that subgraph sim-ilarity search is NP hard, while graph similarity match is polyno-mial. By studying the principles behind this measure, we found aninformation propagation model that is able to convert a large net-work into a set of multidimensional vectors, where sophisticatedindexing and similarity search algorithms are available. The pro-posed method, called Ness (Neighborhood Based Similarity Search),is appropriate for graphs with low automorphism and high noise,which are common in many social and information networks. Nessis not only efficient, but also robust against structural noise and in-formation loss. Empirical results show that it can quickly and accu-rately find high-quality matches in large networks, with negligiblecost.

Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Search process; I.2.8[Problem Solving, Control Methods, and Search]: Graph andtree search strategies

General TermsAlgorithms, Performance

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD11, June 1216, 2011, Athens, Greece.Copyright 2011 ACM 978-1-4503-0661-4/11/06 ...$10.00.

KeywordsGraph Query, Graph Search, Graph Alignment, RDF

1. INTRODUCTIONRecent advances in social and information science have shown

that linked data pervade our society and the natural world aroundus [36]. Graphs become increasingly important to represent com-plicated structures and schema-less data such as wikipedia, free-base [15] and various social networks. Given an attributed networkand a small query graph, how to efficiently search the query graphin the target network is a critical task for many graph applications.It has been extensively studied in chemi-informatics, bioinformat-ics, XML and Semantic Web. SPARQL [27] is the state-of-theRDF query language for Semantic Web. SPARQL requires ac-curate knowledge about the graph structure to write a query andalso it performs an exact graph pattern matching. However, due tothe noise and the incomplete information (structure and content) inmany networks, it is not realistic to find exact matches for a givenquery. It is more appealing to find the top-k approximate matches.

Unfortunately, graph similarity measures such as subgraph iso-morphism, maximum common subgraphs, graph edit distance, miss-ing edges that are appropriate for chemical structures and biolog-ical networks, are not suitable for entity-relationship graphs andsocial networks. There are two challenging issues for these graphtheoretic measures. First, entity-relationship graphs and social net-works have quite different characteristics from physical networks.They are not governed by physical laws and often full of noise, thusmaking strict topological similarity examination nearly impossible.How the entities are connected in these networks are not as im-portant as how closely these entities are connected. Second, thesegraphs are very large and complex with a lot of attributes associ-ated. If accuracy is to be ensured, the algorithms developed foredit distance and missing edges are not scalable. These two is-sues motivate us to invent new graph similarity measures that areless sensitive to structure changes, and have scalable indexing andsearch solutions.

Figure 1(a) shows a graph query to Find the athlete who is fromRomania and won gold in 3000m and bronze in 1500mboth in 1984 olympics.1. Compare this query against a possi-ble match in FreeBase (Olympics) shown in Figure 1(b), it is ob-served that these two graphs are by no means similar under tra-ditional graph similarity definitions. Graph edit distance between

1http://en.wikipedia.org/wiki/Maricica_Puica

Romania

1984

(a) Query

(b) Match in Freebase

Bronze Gold1500m 3000m

Romania

Maricica Puica

1984Bronze 1500m 3000m Gold

Figure 1: Top-1 Match for Query (a) in FreeBase

these two graphs is 7. The size of their maximum common graphis 3. The number of maximum missing edges for the query graphis 4. However, Maricica Purca in Figure 1(b) is a good match forthe query shown in Figure 1(a), because she has all these attributesquite close to her in Figure 1(b). In practice, it is hard to come upwith a query that exactly conforms with the graph structures in thetarget network due to the lack of schemas in linked data. However,it is easy to write a query like Figure 1(a), where a user connectsentities with possible links. As long as the the proximity betweenthese entities is approximately maintained in a query graph, thesystem shall be able to deliver matches like Figure 1(b).

The above approximate query form can serve as a primitive formany advanced graph operators such as RDF query answering, net-work alignment, subgraph similarity search, name disambiguationand database schema matching. For example, based on partial in-formation related to one person, e.g. his friends, one can align hisphysical social circle with his cyber social network on Facebook. Inmany cases, nodes in social or information networks have incom-plete information or even anonymized information. Nevertheless,the partial neighborhood information available from a query graphwill be helpful to identify entities in the target network.

Clearly, there is a need to adopt approximate similarity searchtechniques to solve the above problem. In bioinformatics, approxi-mate graph alignment has been extensively studied, e.g. PathBlast[21], Saga [33]. These studies resort to strict approximation defini-tion such as graph edit distance, whose optimal solution is expen-sive to compute. Since they are targeting a relatively small biolog-ical networks with less than 10k nodes, it is difficult to apply themin social and information networks with thousands or even millionsof nodes. As illustrated in NetAlign [23], in order to handle largegraphs with 10k nodes, one has to sacrifice accuracy to achieve bet-ter query response time. Recently there have been other studies onapproximate matching with large graphs, i.e., TALE [34], SIGMA[24] and G-Ray [35]. However, both TALE and SIGMA considerthe number of missing edges as the qualitative measure of approxi-mate matching and hence, the techniques cannot capture the notionof proximity among labels, as shown in Figure 1. G-Ray, on theother hand, tries to maintain the shape of the query by allowingsome approximation in the match. Unfortunately, shape is not animportant factor in entity-relationship graphs.

In this paper, we introduce a novel neighborhood-based similar-ity measure by vectorizing nodes according to the label distributionof their neighbors. We further extend the similarity notion to graphby finding the embeddings in the target graph that maximize thesum of node matches. This graph matching technique avoids com-plicated subgraph isomorphism and graph edit distance calculation,which becomes infeasible for large graphs. It is observed that so-

cial/information networks usually have more diversified node la-bels and therefore less auto-isomorphic structure, but may containmore noise. Our objective function can provide better similaritysemantics for graphs with various random noise. It simplifies theprocedure of graph matching, leading to the development of an ef-ficient graph search framework, called Ness (Neighborhood BasedSimilarity Search). With the introduction of scalable indices builton vectorized nodes and an intelligent query optimization tech-nique, Ness can quickly and accurately find high-quality matchesin large networks, with negligible time cost.

Our contributions. We propose a novel similarity search problemin graphs, neighborhood-based similarity search, which com-bines the topological structure and content information togetherduring the search process. The similarity definition proposed inthis work is able to avoid expensive isomorphism testing as muchas possible. The principles to derive appropriate functions to fit thisdefinition are carefully examined. We found that the informationpropagation model satisfies these principles, where each node prop-agates a certain fraction of its labels to its neighbors, and therebywe could convert each node into a multidimensional vector, wheresophisticated indexing and similarity search algorithms are avail-able. That is, we successfully turn a graph search problem into ahigh-dimension index problem.

We first identify a set of rules to define approximate matchesof nodes based on their neighborhood structure and labels. Theserules are important since the query may not always have completeinformation about the exact neighborhood structure in the targetgraph. The approximate node match concept is further extendedto subgraph similarity search, i.e. multiple node alignment for agiven query graph. We prove that under this measure, subgraphsimilarity search is NP hard. However, in comparison with graphisomorphism, which is neither known to be solvable in polynomialtime nor NP-hard, graph similarity match is proved to be polyno-mial. We demonstrate that, without performing subgraph isomor-phism testing, it is possible to prune unpromising nodes by itera-tively propagating node information among a shrinking candidateset, which significantly reduces query execution time.

We further analyze how to index the vector structure as well asoptimize query processing to speed up similarity search. The infor-mation propagation model and the neighborhood vectorization ap-proach keep the index structure much simpler than the graph itself,thus making it easy to be updated dynamically for graph changesarising from node/edge insertion and deletion.

In summary, we propose a completely new graph similarity searchframework, Ness, to define and determine approximate matches inmassive graphs. As tested in real and synthetic networks, Ness isable to find high-quality matches efficiently in large scale networks.

2. PRELIMINARIESA labeled graph G = (VG, EG, LG) has a label set LG and each

node u VG is attached with a set of labels. The label set of a nodeu in G is denoted by L(u) LG. For the sake of simplicity, weassume there are no labels and weights on the edges. Nevertheless,the proposed techniques could be extended for graphs with labeledor weighted edges. Given two labeled graphs G and G, G is calledsubgraph isomorphic to G, if there exists a subgraph H of G,such that G is isomorphic to H . Formally, we define subgraphisomorphism as follow.

DEFINITION 1 (SUBGRAPH ISOMORPHISM). A subgraph iso-morphism is an injective function f : VG VG , s.t., (1) u

VG, L(u) L(f(u)), and (2) (u, v) EG, (f(u), f(v)) EG .

DEFINITION 2 (EMBEDDING). Given a graph G and a querygraph Q, an embedding of Q is an (injective) function f : VQ VG, such that, v VQ, L(v) L(f(v)), where f(v) V (G).

In this work, we only studied the one-to-one node matching fora query graph Q and the node labels are preserved in the em-bedding. However, our cost function and algorithms can be ex-tended to include other matching and node label similarity scenar-ios. Given two graphs G and Q, there might be many possibleembeddings. Certainly, the quality of an embedding depends onwhether it preserves the connections and labels in the query graphor not. Subgraph isomorphism actually defines an exact embed-ding, written as fe. The quality of an embedding can be definedin various ways; i.e., for a given label-preserved embedding f , wecan count the number of edge mismatches, Ce = |{(u, v) EQ :(f(u), f(v)) EG}|, as the embeddings quality. In general, fora cost function C : f R , we define the top-k graph similaritysearch problem as below.

PROBLEM STATEMENT 1. Given a graph G and a query graphQ, find the top-k embeddings with respect to a cost function C.

The edge mismatch cost function Ce has been studied in [38, 34,24]. Unfortunately, it cannot differentiate the case where two nodesare close to each other but there is no direct edge between them.

a

c b

b c

a bc

a v1 v2 v3

u1 u3 u2

u'1 u'3 u'2

f2

f1

Q

G

Figure 2: Problem with EdgeMismatch Cost Function

ab

da

dce

cfh

g

u

Figure 3: Information Prop-agation Model

Figure 2 shows one example. There are two label-preserved em-beddings f1 and f2 of the query graph Q in a target graph G. In f1and f2, there is no edge connecting a and b. Thus, Ce will assignequal cost to both embeddings. On the other hand, the graph editdistance between f1 and Q is 2, whereas it is only 1 between f2 andQ. Although, intuitively it is observed that f1 is a better match thanf2, because the nodes with labels a and b are only 2-hops away inf1, whereas they are disconnected in f2. This observation inspiresus to develop a neighborhood-based similarity measure that dis-counts how nodes are exactly connected, but focuses on the prox-imity among the labels carried by these nodes. It needs to achievethe following two objectives: (1) The cost function should identifyapproximate embeddings, and (2) it must be easy to compute. Inthe next section, we will define the neighborhood-based similaritycost function and the complexity analysis of that function.

3. NEIGHBORHOOD-BASED GRAPH SIM-ILARITY

In order to solve the problem raised by the edge mismatch costfunction, we define a novel neighborhood-based similarity measureby comparing the h-hop neighbors of a node, defined as follows.

DEFINITION 3 (h-HOP NEIGHBORS). Given a graph G anda node u V (G), the h-hop neighborhood of u is the set of nodesv whose distance from u is less than or...