Subgraph Matching with Set Similarity in a Large Graph ... Subgraph Matching with Set Similarity in a Large Graph Database Liang Hong, Lei Zou, Xiang Lian, Philip S. Yu Abstract—In real-world graphs such as social

  • View
    213

  • Download
    1

Embed Size (px)

Transcript

  • 1

    Subgraph Matching with Set Similarity in aLarge Graph Database

    Liang Hong, Lei Zou, Xiang Lian, Philip S. Yu

    AbstractIn real-world graphs such as social networks, Semantic Web and biological networks, each vertex usually containsrich information, which can be modeled by a set of tokens or elements. In this paper, we study a subgraph matching with setsimilarity (SMS2) query over a large graph database, which retrieves subgraphs that are structurally isomorphic to the querygraph, and meanwhile satisfy the condition of vertex pair matching with the (dynamic) weighted set similarity. To efficientlyprocess the SMS2 query, this paper designs a novel lattice-based index for data graph, and lightweight signatures for both queryvertices and data vertices. Based on the index and signatures, we propose an efficient two-phase pruning strategy includingset similarity pruning and structure-based pruning, which exploits the unique features of both (dynamic) weighted set similarityand graph topology. We also propose an efficient dominating-set-based subgraph matching algorithm guided by a dominating setselection algorithm to achieve better query performance. Extensive experiments on both real and synthetic datasets demonstratethat our method outperforms state-of-the-art methods by an order of magnitude.

    Index Termssubgraph matching, set similarity, graph database, index

    F

    1 INTRODUCTIONWith the emergence of many real applications such as socialnetworks, Semantic Web, biological networks, and so on[1, 2, 3, 4, 5, 6], graph databases have been widely usedas important tools to model and query complex graph data.Much prior work has extensively studied various types ofqueries over graphs, in which subgraph matching [7, 8,9, 10] is a fundamental graph query type. Given a querygraph Q and a large graph G, a typical subgraph matchingquery retrieves those subgraphs in G that exactly matchwith Q in terms of both graph structure and vertex labels[7]. However, in some real graph applications, each vertexoften contains a rich set of tokens or elements representingfeatures of the vertex, and the exact matching of vertexlabels is sometimes not feasible or practical.

    Motivated by the observation above, in this paper, wefocus on a variant of the subgraph matching query, calledsubgraph matching with set similarity (SMS2) query, inwhich each vertex is associated with a set of elements withdynamic weights instead of a single label. The weightsof elements are specified by users in different queriesaccording to different application requirements or evolvingdata. Specifically, given a query graph Q with n vertices ui(i = 1, ..., n), the SMS2 query retrieves all the subgraphsX with n vertices vj (j = 1, ..., n) in a large graph G,such that (1) the weighted set similarity between S(ui) andS(vj) is larger than a user specified similarity threshold,

    Liang Hong is with the State Key Laboratory of Software Engineeringand School of Information Management, Wuhan University, Wuhan,China, E-mail: hong@whu.edu.cn.

    Lei Zou is with the ICST, Peking University, Beijing, China, E-mail:zoulei@pku.edu.cn. The corresponding author of this paper is Lei Zou.

    Xiang Lian is with the Dept. of Computer Science, University of TexasPan American, TX, USA, E-mail: lianx@utpa.edu.

    Philip S. Yu is with the Dept. of Computer Science, University of Illi-nois, Chicago, Chicago, USA, and Institute for Data Science, TsinghuaUniversity, Beijing, China, E-mail: psyu@uic.edu.

    where S(ui) and S(vj) are sets associated with ui and vj ,respectively; (2) X is structurally isomorphic to Q with uimapping to vj . We discuss the following two examples todemonstrate the usefulness of SMS2 queries.

    Example 1 (Finding papers in DBLP)The DBLP1 computer science bibliography provides a ci-

    tation graph G (Fig. 1(b)) in which vertices represent papersand edges represent citation relations between papers. Eachpaper contains a set of keywords, in which each keyword isassigned a weight to measure the importance of a keywordwith regard to the paper. In reality, a researcher searchesfor similar papers from DBLP based on both citationrelationships and the content similarity of papers [11]. Forexample, a researcher wants to find papers on subgraphmatching that are cited by both social network papers andpapers on protein interaction network search. Furthermore,she/he requires papers on protein interaction network searchbeing cited by social network papers. Such query canbe modeled as an SMS2 query, which obtains subgraphmatches of the query graph Q (Fig. 1(a)) in G. Each paper(vertex) in Q and its matching paper in G should havesimilar set of keywords, and each citation relation (edge)exactly follows the researchers requirements.

    u2

    u3

    u4

    u1

    v1v8

    v9

    v6

    v7v2

    v5

    v3 v4{UML}

    {SQL}

    {C++}

    {UML}

    {C++,SQL}

    {SQL}

    {C++} {PHP,UML}

    {SQL}

    {PHP,SQL}{PHP,SQL}

    {PHP}

    {UML}

    {SocialNetwork}

    {ProteinInteractionNetwork,Search}

    {SubgraphMatching}

    {SocialNetwork,Search}

    {SubgraphMatching}

    {ProteinInteractionNetwork,Search}

    {SocialNetwork,SubgraphMatching}

    {SocialNetwork}

    {SubgraphMatching}

    u2

    u1

    u3

    v1

    v2

    v3

    v4

    v5

    v6

    (a) Query Graph Q

    u2

    u3

    u4

    u1

    v1v8

    v9

    v6

    v7v2

    v5

    v3 v4{UML}

    {SQL}

    {C++}

    {UML}

    {C++,SQL}

    {SQL}

    {C++} {PHP,UML}

    {SQL}

    {PHP,SQL}{PHP,SQL}

    {PHP}

    {UML}

    {SocialNetwork}

    {ProteinInteractionNetwork,Search}

    {SubgraphMatching}

    {SocialNetwork,Search}

    {SubgraphMatching}

    {ProteinInteractionNetwork,Search}

    {SocialNetwork,SubgraphMatching}

    {SocialNetwork}

    {SubgraphMatching}

    u2

    u1

    u3

    v1

    v2

    v3

    v4

    v5

    v6

    (b) Citation Graph G of DBLP

    Fig. 1. An Example of Finding Groups of Cited Papers inDBLP that Match with the Query Citation Graph

    1. http://www.informatik.uni-trier.de/ ley/db/

  • 2

    Example 2 (Querying DBpedia)DBPedia extracts entities and facts from Wikipedia and

    stores them in an RDF graph [12]. As shown in Fig. 2(b),in a DBPedia RDF graph G, each entity (i.e. vertex) hasan attribute dbpedia-owl:abstract that provides a human-readable description (i.e. a set of words) of the entity, andeach edge is a fact that indicates the relationship betweenentities. Typically, users issue SPARQL queries to find sub-graph matches of the query graph (i.e., a graph of connectedentities with known attribute values) by specifying exactquery criteria. However, in reality, a user may not know(or remember) the exact attribute values or the RDF schema(such as property names). For example, a user wants to findtwo physicists who both won Nobel prizes and are relatedto Denmark from DBpedia, while he/she does not knowthe schema of DBpedia data. In this case, the user canissue an SMS2 query Q, as shown in Fig. 2(a), in whicheach vertex is described by a short text. The answer tothe SMS2 query Q is Niels Bohr and Aage Bohr, becausethe subgraph match is structurally isomorphic to Q and thetext similarity (measured by the word set similarity) of thematching vertex pairs is high. Interestingly, we find thatNiels Bohr is the father of Aage Bohr.

    NobelPrizeinPhysics,awardedbytheRoyalSwedishAcademyofSciences.

    DanishPhysicistwhowonNobelPrize,andresearchonatomicstructureandquantummechanics.

    DanishnuclearphysicistandNobellaureate.

    The KingdomofDenmark.

    The Nobel Prize in Physics isawardedonceayearbytheRoyalSwedishAcademyofSciences.

    Niels Bohr was a Danish physicist who made foundationalcontributions to atomic structure and quantummechanics, forwhichhereceivedtheNobelPrizeinPhysicsin1922.

    AageNielsBohrwasaDanishnuclearphysicistandNobellaureate,andthesonofthefamousphysicistandNobellaureateNielsBohr.

    TheKingdomofDenmarkisa constitutional monarchyandsovereignstate.

    dbpediaowl:birthPlace

    http://dbpedia.org/resource/Nobel_Prize_in_Physics

    dcterms:subject

    dbpediaowl:birthPlace

    http://dbpedia.org/resource/Niels_Bohr

    http://dbpedia.org/resource/Aage_Bohr

    http://dbpedia.org/resource/Denmark

    dcterms:subject

    (a) Query Graph Q without the Schema of DBpedia Data

    NobelPrizeinPhysics,awardedbytheRoyalSwedishAcademyofSciences.

    DanishPhysicistwhowonNobelPrize,andresearchonatomicstructureandquantummechanics.

    DanishnuclearphysicistandNobellaureate.

    The KingdomofDenmark.

    The Nobel Prize in Physics isawardedonceayearbytheRoyalSwedishAcademyofSciences.

    Niels Bohr was a Danish physicist who made foundationalcontributions to atomic structure and quantummechanics, forwhichhereceivedtheNobelPrizeinPhysicsin1922.

    AageNielsBohrwasaDanishnuclearphysicistandNobellaureate,andthesonofthefamousphysicistandNobellaureateNielsBohr.

    TheKingdomofDenmarkisa constitutional monarchyandsovereignstate.

    dbpediaowl:birthPlace

    http://dbpedia.org/resource/Nobel_Prize_in_Physics

    dcterms:subject

    dbpediaowl:birthPlace

    http://dbpedia.org/resource/Niels_Bohr

    http://dbpedia.org/resource/Aage_Bohr

    http://dbpedia.org/resource/Denmark

    dcterms:subject

    (b) RDF Graph G of DBpedia

    Fig. 2. An Example of Querying Subgraph Matches from theRDF Graph of DBpedia

    The motivation examples show that SMS2 queries arevery useful in many real-world applications. To the best ofour knowledge, no prior work studied the subgraph match-ing problem under the semantic of structural isomorphismand set similarity with dynamic element weights (calleddynamic weighted set similarity). Traditional weighted setsimilarity [13] that focuses on fixed elem

Recommended

View more >