Neighborhood Based Fast Graph Search in Large propagation model that is able to convert a large net-work into a set of ... graph similarity measures such as subgraph ... and database schema matching

Embed Size (px)

Text of Neighborhood Based Fast Graph Search in Large propagation model that is able to convert a large...

  • Neighborhood Based Fast Graph Search in Large Networks

    Arijit KhanDept. of Computer Science

    University of CaliforniaSanta Barbara, CA 93106

    Nan LiDept. of Computer Science

    University of CaliforniaSanta Barbara, CA 93106

    Xifeng YanDept. of Computer Science

    University of CaliforniaSanta Barbara, CA 93106

    Ziyu GuanDept. of Computer Science

    University of CaliforniaSanta Barbara, CA 93106

    Supriyo ChakrabortyDept. of Electrical Engineering

    University of CaliforniaLos Angeles, CA 90095

    Shu TaoIBM T. J. Watson19 Skyline Drive

    Hawthorne, NY

    ABSTRACTComplex social and information network search becomes impor-tant with a variety of applications. In the core of these applications,lies a common and critical problem: Given a labeled network anda query graph, how to efficiently search the query graph in the tar-get network. The presence of noise and the incomplete knowledgeabout the structure and content of the target network make it unre-alistic to find an exact match. Rather, it is more appealing to findthe top-k approximate matches.

    In this paper, we propose a neighborhood-based similarity mea-sure that could avoid costly graph isomorphism and edit distancecomputation. Under this new measure, we prove that subgraph sim-ilarity search is NP hard, while graph similarity match is polyno-mial. By studying the principles behind this measure, we found aninformation propagation model that is able to convert a large net-work into a set of multidimensional vectors, where sophisticatedindexing and similarity search algorithms are available. The pro-posed method, called Ness (Neighborhood Based Similarity Search),is appropriate for graphs with low automorphism and high noise,which are common in many social and information networks. Nessis not only efficient, but also robust against structural noise and in-formation loss. Empirical results show that it can quickly and accu-rately find high-quality matches in large networks, with negligiblecost.

    Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Search process; I.2.8[Problem Solving, Control Methods, and Search]: Graph andtree search strategies

    General TermsAlgorithms, Performance

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD11, June 1216, 2011, Athens, Greece.Copyright 2011 ACM 978-1-4503-0661-4/11/06 ...$10.00.

    KeywordsGraph Query, Graph Search, Graph Alignment, RDF

    1. INTRODUCTIONRecent advances in social and information science have shown

    that linked data pervade our society and the natural world aroundus [36]. Graphs become increasingly important to represent com-plicated structures and schema-less data such as wikipedia, free-base [15] and various social networks. Given an attributed networkand a small query graph, how to efficiently search the query graphin the target network is a critical task for many graph applications.It has been extensively studied in chemi-informatics, bioinformat-ics, XML and Semantic Web. SPARQL [27] is the state-of-theRDF query language for Semantic Web. SPARQL requires ac-curate knowledge about the graph structure to write a query andalso it performs an exact graph pattern matching. However, due tothe noise and the incomplete information (structure and content) inmany networks, it is not realistic to find exact matches for a givenquery. It is more appealing to find the top-k approximate matches.

    Unfortunately, graph similarity measures such as subgraph iso-morphism, maximum common subgraphs, graph edit distance, miss-ing edges that are appropriate for chemical structures and biolog-ical networks, are not suitable for entity-relationship graphs andsocial networks. There are two challenging issues for these graphtheoretic measures. First, entity-relationship graphs and social net-works have quite different characteristics from physical networks.They are not governed by physical laws and often full of noise, thusmaking strict topological similarity examination nearly impossible.How the entities are connected in these networks are not as im-portant as how closely these entities are connected. Second, thesegraphs are very large and complex with a lot of attributes associ-ated. If accuracy is to be ensured, the algorithms developed foredit distance and missing edges are not scalable. These two is-sues motivate us to invent new graph similarity measures that areless sensitive to structure changes, and have scalable indexing andsearch solutions.

    Figure 1(a) shows a graph query to Find the athlete who is fromRomania and won gold in 3000m and bronze in 1500mboth in 1984 olympics.1. Compare this query against a possi-ble match in FreeBase (Olympics) shown in Figure 1(b), it is ob-served that these two graphs are by no means similar under tra-ditional graph similarity definitions. Graph edit distance between


  • Romania


    (a) Query

    (b) Match in Freebase

    Bronze Gold1500m 3000m


    Maricica Puica

    1984Bronze 1500m 3000m Gold

    Figure 1: Top-1 Match for Query (a) in FreeBase

    these two graphs is 7. The size of their maximum common graphis 3. The number of maximum missing edges for the query graphis 4. However, Maricica Purca in Figure 1(b) is a good match forthe query shown in Figure 1(a), because she has all these attributesquite close to her in Figure 1(b). In practice, it is hard to come upwith a query that exactly conforms with the graph structures in thetarget network due to the lack of schemas in linked data. However,it is easy to write a query like Figure 1(a), where a user connectsentities with possible links. As long as the the proximity betweenthese entities is approximately maintained in a query graph, thesystem shall be able to deliver matches like Figure 1(b).

    The above approximate query form can serve as a primitive formany advanced graph operators such as RDF query answering, net-work alignment, subgraph similarity search, name disambiguationand database schema matching. For example, based on partial in-formation related to one person, e.g. his friends, one can align hisphysical social circle with his cyber social network on Facebook. Inmany cases, nodes in social or information networks have incom-plete information or even anonymized information. Nevertheless,the partial neighborhood information available from a query graphwill be helpful to identify entities in the target network.

    Clearly, there is a need to adopt approximate similarity searchtechniques to solve the above problem. In bioinformatics, approxi-mate graph alignment has been extensively studied, e.g. PathBlast[21], Saga [33]. These studies resort to strict approximation defini-tion such as graph edit distance, whose optimal solution is expen-sive to compute. Since they are targeting a relatively small biolog-ical networks with less than 10k nodes, it is difficult to apply themin social and information networks with thousands or even millionsof nodes. As illustrated in NetAlign [23], in order to handle largegraphs with 10k nodes, one has to sacrifice accuracy to achieve bet-ter query response time. Recently there have been other studies onapproximate matching with large graphs, i.e., TALE [34], SIGMA[24] and G-Ray [35]. However, both TALE and SIGMA considerthe number of missing edges as the qualitative measure of approxi-mate matching and hence, the techniques cannot capture the notionof proximity among labels, as shown in Figure 1. G-Ray, on theother hand, tries to maintain the shape of the query by allowingsome approximation in the match. Unfortunately, shape is not animportant factor in entity-relationship graphs.

    In this paper, we introduce a novel neighborhood-based similar-ity measure by vectorizing nodes according to the label distributionof their neighbors. We further extend the similarity notion to graphby finding the embeddings in the target graph that maximize thesum of node matches. This graph matching technique avoids com-plicated subgraph isomorphism and graph edit distance calculation,which becomes infeasible for large graphs. It is observed that so-

    cial/information networks usually have more diversified node la-bels and therefore less auto-isomorphic structure, but may containmore noise. Our objective function can provide better similaritysemantics for graphs with various random noise. It simplifies theprocedure of graph matching, leading to the development of an ef-ficient graph search framework, called Ness (Neighborhood BasedSimilarity Search). With the introduction of scalable indices builton vectorized nodes and an intelligent query optimization tech-nique, Ness can quickly and accurately find high-quality matchesin large networks, with negligible time cost.

    Our contributions. We propose a novel similarity search problemin graphs, neighborhood-based similarity search, which com-bines the topological structure and content information togetherduring the search process. The similarity definition proposed inthis work is able to avoid expensive isomorphism testing as muchas possible. The principles to derive appropriate functions to fit thisdefinition are carefully examined. We found that the informationpropagation model satisfies these principles, where each node prop-agates a certain fraction of its labels to its neighbors, and therebywe could conver


View more >