14

Click here to load reader

Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

Embed Size (px)

Citation preview

Page 1: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

Graph Pattern Mining, Search and OLAP

Xifeng Yan

November 21, 2012

1 Graph Pattern Mining

Graph patterns become increasingly important in analyzing complex struc-tures in many domains such as information networks, social networks, andcomputer security. They can be utilized to index, search, classify, cluster,predict interactions and functions in graphs.

Frequent Graph Pattern Among various kinds of graph patterns, frequentsubgraphs are the very basic pattern that can be discovered in a collection ofgraphs. There are two problems settings: multiple graphs and single graph.

• Multiple Graphs: Given a set of graphs, D = {G1, G2, . . . , Gn}, a graphg is frequent if g is a subgraph of at least s graphs in D, where s is auser-specified threshold, 1 ≤ s ≤ n.

• Single Graph: Given a graph G, a graph g is frequent if g has morethan s (disjoint) embeddings in G, where s is a user-specified threshold,1 ≤ s ≤ |V (G)|.

The existing studies are mostly focused on the multiple graphs scenario.With some modifications, the mining methodology can be extended to thesingle graph scenario [30]. Washio and Motoda [56] conducted a surveyon graph-based data mining. Holder et al. [21] proposed SUBDUE to dosubgraph pattern discovery based on minimum description length and back-ground knowledge. The most popular graph pattern mining algorithms adapteither Apriori-based or pattern-growth approach.

In an Apriori-based approach, the search for frequent subgraphs startswith graphs of small size, and proceeds in a bottom-up manner. At each

1

Page 2: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

iteration, the size of newly discovered frequent subgraphs is increased byone node or edge. The new candidates are generated by joining two similarbut slightly different frequent subgraphs that were discovered already. Thefrequency of the newly formed graphs is then checked. Typical Apriori-basedfrequent graph pattern mining algorithms include AGM [23], FSG [29] , andan edge-disjoint path-join algorithm [55].

In a pattern-growth approach, a frequent graph is extended directly byadding a new node or edge, in every possible position. A potential problemwith this extension approach is that the same graph can be discovered manytimes. The gSpan [59] algorithm solves this problem by introducing a right-most extension technique, where the only extensions take place on the right-most path. Many other algorithms adapt a similar strategy, including MoFa[4], FFSM [22], and Gaston [42].

Graph Patterns with Constraints Constraint-based graph pattern min-ing finds frequent graph patterns that satisfy user-specified constraints suchas degree, density, frequency, size etc. Mining closed graph patterns was stud-ied in [60]. The goal is to reduce the number of graph patterns by removingsubgraph patterns that can be derived from other patterns. Techniques weredeveloped for pushing constraints as deep as possible in the mining process[65].

Approximate Graph Patterns Due to the complexity of isomorphismtesting and the inelastic pattern definition, frequent subgraphs are not able tocapture approximate graph patterns. In [28], proximity pattern is defined asa set of labels that co-occur frequently in neighborhoods. It relaxes the rigidstructure constraint of frequent subgraphs, while introducing connectivity tofrequent itemsets. Empirical results show that it not only finds interestingpatterns that are ignored by the existing approaches, but also achieves highperformance for finding proximity patterns in large-scale graphs.

Due to the exponential set of frequent graph patterns, it is necessaryto discover the most representative ones. Random sampling techniques aredeveloped to sample the pattern space uniformly and equally [18]. By doingso, the mining time can be significantly improved while the number of similarpatterns can be reduced.

Discriminative Graph Patterns Discriminative graph pattern mining isto find significant graph patterns that can tell the difference between twosets of graphs. The two sets of graphs could be graphs with different classlabels. The discovered discriminative graph patterns can be used as features

2

Page 3: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

for classification. [52] proposed an algorithm for mining the minimal contrastsubgraph which is able to capture the structural differences between any twocollections of graphs. LEAP [58] is a general approach to leverage structuralproximity and frequency association to quickly skip pattern search space andfind discriminative graph patterns, with respect to the objective functiongiven by a user.

2 Graph Search

Development of scalable methods for analyzing large graph data sets, in-cluding graphs built from knowledge base and social networks, poses greatchallenges. At the core of many graph analysis applications, lies a com-mon and critical problem: how to efficiently search graphs. There are twoproblems settings: multiple graphs and single graph.

• Multiple Graphs: Given a set of graphs, D = {G1, G2, . . . , Gn} and aquery graph g, graph search returns an answer set Dg = {G|M(g, G) =1, G ∈ D}, where M is a boolean function. M could be a functiontesting graph isomorphism (full graph search), subgraph isomorphism(subgraph search), approximate match (full graph similarity search),and subgraph approximate match (subgraph similarity search).

• Single Graph: Given a graph G and a query graph g, find all theembeddings of g in G.

For graph search, it is inefficient to perform a sequential scan on a graphdatabase and check each graph to find answers to a query graph. Sequentialscan is costly because one has to not only access the whole graph databasebut also check (sub)graph isomorphism. It is known that subgraph isomor-phism is an NP-complete problem [9]. Ullmann’s backtracking method [54],VF2 [11], SwiftIndex[45] are the popular programs for subgraph isomorphismchecking. Therefore, high performance graph indexing is needed to quicklyprune graphs or regions of a graph that obviously violate the query require-ment.

The problem of graph search has been addressed in different domainssince it is a critical problem in many applications. In content-based imageretrieval, [44] represented each graph as a vector of features and indexedgraphs in a high dimensional space using R-trees. [47] indexed graphs by

3

Page 4: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

a signature computed from the eigenvalues of adjacency matrices. Insteadof casting a graph to a vector form, [3] proposed a metric indexing schemewhich organizes graphs hierarchically according to their mutual distances.

In semistructured/XML databases, query languages built on path ex-pressions become popular. Efficient indexing techniques for path expressionwere initially introduced in DataGuide [16] and 1-index [38]. A(k)-index [26]proposes k-bisimilarity to exploit local similarity existing in semistructureddatabases. Index Fabric [10] represents every path in a tree as a string andstores it in a Patricia trie.

For more complicated graph queries, Shasha et al. [46] (GraphGrep) ex-tended the path-based technique to do full scale graph retrieval. GraphGrepis an example of feature-based graph indexing techniques. Let F be a featureset for a given graph database D. For any feature f ∈ F , Df is the set ofgraphs containing f , Df = {G|f ⊆ G, G ∈ D}. The graph query processinghas three steps: (1) Search, which enumerates all the features in a querygraph, Q, to compute the candidate query answer set, CQ =

⋂f Df (f ⊆ Q

and f ∈ F ); each graph in CQ contains all of Q’s features. Therefore, DQ

is a subset of CQ. (2) Fetching, which retrieves the graphs in the candidateanswer set from disks. (3) Verification, which checks the graphs in the can-didate answer set to verify if they really satisfy the query. The candidateanswer set is verified to prune false positives.

gIndex [61] introduces a pattern-based indexing techniques that facilitategraph search in graph databases with thousands of instances. Nevertheless,similar techniques can also be applied to indexing single massive graphs. Theidea is to precompute features from a graph database and build indices basedon these features. There are various kinds of features that could be used,including node/edge labels, paths, trees, and subgraph patterns. gIndex isa subgraph pattern-based approach, while GraphGrep is a path-based ap-proach. FG-index [7] builds index using frequent subgraphs too. However, itdirectly answer frequent graph queries without verification.

Zhao et al. [63] analyzed the effectiveness and efficiency of paths, trees,and graphs as indexing features from three aspects: feature size, featureselection cost, and pruning power. Like paths and graphs, tree features canbe effectively and efficiently used as indexing features for graph databases.GString [25] combines three basic structures together: path, star, and cyclefor graph search.

GCoding [66] is another tree-based graph indexing approach. For eachnode u, it extracts a level-n path tree, which consists of all n-step simple

4

Page 5: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

pathes from u in a graph. The node is then encoded with eigenvalues derivedfrom this local tree structure. If a query graph Q is a subgraph of a graphG, for each vertex u in Q, there must exist a corresponding vertex u′ in G

such that the local structure around u in Q should be preserved around u′

in G. There is a partial order relationship between the eigenvalues of thesetwo local structures. Based on this property, GCoding could quickly prunegraphs that violate the order.

Closure-Tree [19] organizes graphs into a tree-based index structure usinggraph closures as the bounding boxes.

3 Graph Similarity Search

A common problem in graph search is: what if there is no match or veryfew matches for a given query graph? In this situation, a subsequent queryrefinement process has to be taken in order to find the structures of interest.Unfortunately, it is often too time-consuming for a user to manually refine thequery. One solution is to ask the system to find graphs that approximatelycontain the query graph. This similarity search problem has been studied invarious fields.

There have been numerous studies on inexact graph search in large graphs.Tong et al. [53] proposed the best-effort pattern matching, which aims tomaintain the shape of the query. Tian et al. [51] proposed an approxi-mate subgraph search tool, called TALE, with efficient indexing. Mongioviet. al. introduced a set-cover-based inexact subgraph matching technique,called SIGMA [39]. Both of the techniques use edge misses to measure thequality of a matches. There are other works on inexact subgraph match-ing. An incomplete list (see [15] for surveys) includes homomorphism basedsubgraph matching [13], belief propagation based net alignment [2, 14], edge-edit-distance based subgraph indexing technique [62], subgraph matching inbillion node graphs [48], regular expression based graph pattern matching[1], schema [36] and unbalanced ontology matching [64], and graph partitionbased subgraph identification scheme [5].

NESS [27] introduces a relaxed, computationally effective definition ofapproximate graph matching by changing the strict subgraph isomorphismchecking to proximity checking. Under this new measure, it was provedthat subgraph similarity search is NP hard, while graph similarity matchis polynomial. An information propagation model was applied. It is able

5

Page 6: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

to convert a large network into a set of multidimensional vectors, wheresophisticated indexing and similarity search algorithms are available. Nessis appropriate for graphs with low automorphism and high noise, which arecommon in many social and information networks.

There are several studies on simulation and bisimulation-based graphpattern matching, e.g., [37, 12, 34], which define subgraph matching as arelation among the query nodes and target nodes.

4 Graph Query Language

In the area of graph databases, a few of graph query languages have beenproposed to query and manage graph data. GraphLog [8] represents bothdata and queries as graphs. Edges in queries represent edges or paths in thedatabase, indicting a regular expression kind of query. In terms of expressivepower, GraphLog was showed equivalent to stratified linear Datalog. Graphquery languages were also introduced with oriented object data models inGOOD [33], GraphDB[17], and GOQL [31].

GraphQL [20] is a new graph query language that treat graphs as thebasic unit. It has an algebraic system similar to SQL, but the algebraicoperators are defined directly on graphs. [40] proposed ego-centric patterncensus queries, where a given structural pattern is searched in every node’sneighborhood and the counts are reported or used in further analysis. Thiskind of analysis is useful in opinion leader identification, node classification,link prediction, and role identification. It developed an SQL-based declara-tive language and a series of efficient query evaluation algorithms for it.

5 Graph OLAP

Graph OLAP aims to provide a model to perform composite structure andinformation analysis in heterogonous networks. For example, in terms ofnetwork intrusions, apart from the topological structures encoded in the un-derlying network, multidimensional attributes are often specified and associ-ated with nodes and edges, e.g., security software installed in computers, de-fense strategies, access policies, etc., forming the so-called multidimensionalnetworks. While studies on contemporary networks have been around fordecades [41] , and a plethora of algorithms and systems have been devised

6

Page 7: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

for multidimensional analysis in relational databases [24], none has takenboth aspects into account in the multidimensional network scenario. GraphOLAP is the technique developed to fill the technology gaps in multidimen-sional networks.

Graph OLAP performs discovery-driven OLAP operations for fast andaccurate knowledge discovery, through structure discovery, network summa-rization, aggregation, correlation, clustering and classification. The conceptof Graph OLAP was first introduced in [6]. Two kinds of OLAPs were de-fined: Informational OLAP (abbr. I-OLAP) and Topological OLAP (abbr.T-OLAP). For roll-up in I-OLAP, the characterizing feature is that, snap-shots are just different observations of the same underlying network, and thuswhen they are all grouped into one cell in the cube, it is like overlaying mul-tiple pieces of information, without changing the objects whose interactionsare being looked at. For roll-up in T-OLAP, the reorganization switches tohappen inside individual networks. Here, merging is performed internallywhich zooms out the users focus to a generalized set of objects, and a newgraph formed by such shrinking might greatly alter the original networkstopological structure. where

[50] introduced two potential operations to summarize graphs, a keystepin T-OLAP. The first operation, called SNAP, produces a summary graphby grouping nodes based on user-selected node attributes and relationships.The second operation, called k-SNAP, further allows users to control theresolutions of summaries and provides the drill-down and roll-up abilities tonavigate through summaries with different resolutions. [43] discussed how toefficiently compute T-OLAP using graph cubing techniques. It implementedGraph Cube by combining special characteristics of multidimensional net-works with the existing well-studied data cube techniques.

In addition to graph summarization, another important operation ingraph OLAP is similarity search. Large-scale heterogeneous informationnetworks consist of multi-typed, interconnected objects, it is important toprovide similarity measures in such networks. Intuitively, two objects aresimilar if they are linked by many paths in the network. However, differ-ent semantic meanings behind paths shall be are taken into consideration.[57] studied similarity search that is defined among the same type of objectsin heterogeneous networks, and introduced the concept of meta path-basedsimilarity, where a meta path is a path consisting of a sequence of relationsdefined between different object types (i.e., structural paths at the metalevel). Meta-path similarity turns out to be more meaningful in many sce-

7

Page 8: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

narios compared with random-walk based similarity measures.

6 Vertex Programming

Vertex programming is adopted in several leading distributed graph comput-ing platforms in clusters such as Pregel [35] and GraphLab [32]. They canbe implemented using the bulk synchronous parallel model or asynchronousmodels. Vertex Programming is suitable for graph algorithms that can bemodified to store computation states in vertices and these states can bedistributed and shared with multiple vertices. Pregel and GraphLab havedemonstrated their success in computation of shortest paths, random walk,clustering, and belief propagation which can support many machine learningalgorithms. However, it is unknown if an effective implementation of sub-graph isomorphism exists using vertex programming. [49] proposed passingpartial matches around computers in order to find a complete match. One canalso implement a centralized algorithm that collets partial matchings fromdifferent machines and assembles them in a center machine. Both algorithmshave pros and cons. They are not compatible with vertex programming andneed a special demon process in computers to coordinate the partial resultassembly. Our approximate graph search algorithms that use message pass-ing between vertices, e.g., NESS [27], are suitable for vertex programming.NESS uses vector representation of graphs. The neighborhood informationof each vertex is computed by propagating the labels of its neighbors withdistance weighting, which is encoded in each vertex. The best matches ofeach vertex can be further passed to its neighbors to find the best match ofthe entire vertex set. The structure of a vertex’s neighbors is encoded withtheir distance to that vertex. When the number of distinct labels is high in agraph, NESS will likely find a good match in terms of subgraph isomorphism.

References

[1] P. Barcelo, L. Libkin, and J. L. Reutter. Querying Graph Patterns.PODS, 2011.

[2] M. Bayati, M. Gerritsen, D. F. Gleich, A. Saberi, and Y. Wang. Algo-rithms for Large, Sparse Network Alignment Problems. ICDM, 2009.

8

Page 9: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

[3] S. Beretti, A. Bimbo, and E. Vicario. Efficient matching and indexingof graph models in content based retrieval. IEEE Trans. on PatternAnalysis and Machine Intelligence, 23:1089–1105, 2001.

[4] C. Borgelt and M. Berthold. Mining molecular fragments: Finding rel-evant substructures of molecules. In Proc. of 2002 Int. Conf. on DataMining (ICDM’02), pages 211–218, 2002.

[5] M. Brocheler, A. Pugliese, and V. S. Subrahmanian. COSI: Cloud Ori-ented Subgraph Identification in Massive Social Networks. ASONAM,2010.

[6] F. Zhu J. Han C. Chen, X. Yan and P. S. Yu. Graph olap: Towardsonline analytical processing on graphs. In Proc. 2008 Int. Conf. on DataMining, 2008.

[7] J. Cheng, Y. Ke, W. Ng, and A. Lu. FG-Index: Towards Verification-Free Query Processing on Graph Databases. SIGMOD, 2007.

[8] M. Consens and A. Mendelzon. Graphlog: a visual formalism for reallife recursion. In PODS, 1990.

[9] S. Cook. The complexity of theorem-proving procedures. In Proc. of the3rd ACM Symp. on Theory of Computing (STOC’71), pages 151–158,1971.

[10] B. Cooper, N. Sample, M. Franklin, G. Hjaltason, and M. Shadmon. AFast Index for Semistructured Data. VLDB, 2001.

[11] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub)graph Iso-morphism Algorithm for Matching Large Graphs. IEEE Tran. PatternAnal. and Machine Int., 2004.

[12] W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, and Y. Wu. Graph PatternMatching: From Intractable to Polynomial Time. PVLDB, 2010.

[13] W. Fan, J. Li, S. Ma, H. Wang, and Y. Wu. Graph HomomorphismRevisited for Graph Matching. PVLDB, 2010.

[14] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient Belief Propagationfor Early Vision. Int. J. Comput. Vision, 70(1), 2006.

9

Page 10: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

[15] B. Gallagher. Matching Structure and Semantics: A Survey on Graph-Based Pattern Matching. AAAI FS., 2006.

[16] R. Goldman and J. Widom. Dataguides: Enabling query formulationand optimization in semistructured databases. In Proc. of 1997 Int.Conf. on Very Large Data Bases (VLDB’97), pages 436–445, 1997.

[17] R. H. Guting. Graphdb: Modeling and querying graphs in databases.In VLDB, page 297308, 1994.

[18] M. A. Hasan and M. J. Zaki. Output space sampling for graph patterns.Proc. of the VLDB Endowment (35th Int. Conf. on Very Large DataBases), 2(1):730–741, 2009.

[19] H. He and A. Singh. Closure-Tree: An Index Structure for GraphQueries. ICDE, 2006.

[20] H. He and A. Singh. Graphs-at-a-time: query language and access meth-ods for graph databases. In Proc. of the 2008 ACM SIGMOD int. conf.on Management of data, SIGMOD’08, pages 405–418, 2008.

[21] L. B. Holder, D. J. Cook, and S. Djoko. Substructure Discovery in theSubdue System. KDD, 1994.

[22] J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraphin the presence of isomorphism. In Proc. of 2003 Int. Conf. on DataMining (ICDM’03), pages 549–552, 2003.

[23] A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithmfor mining frequent substructures from graph data. In Proc. of 2000European Symp. Principle of Data Mining and Knowledge Discovery(PKDD’00), pages 13–23, 2000.

[24] A. Bosworth A. Layman D. Reichart M. Venkatrao F. Pellow J. Gray,S. Chaudhuri and H. Pirahesh. Data cube: A relational aggregationoperator generalizing group-by, cross-tab, and sub-totals. Data Min.Knowl. Discov.,, 1(1):29–53, 1997.

[25] H. Jiang, H. Wang, P. Yu, and S. Zhou. GString: A Novel Approachfor Efficient Search in Graph Databases. ICDE, 2007.

10

Page 11: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

[26] R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting localsimilarity for efficient indexing of paths in graph structured data. InProc. of 2002 Int. Conf. on Data Engineering (ICDE’02), pages 129–140, 2002.

[27] A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, and S. Tao. Neigh-borhood Based Fast Graph Search in Large Networks. SIGMOD, 2011.

[28] A. Khan, X. Yan, and K.-L. Wu. Towards Proximity Pattern Mining inLarge Graphs. SIGMOD, 2010.

[29] M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM,2001.

[30] M. Kuramochi and G. Karypis. Finding frequent patterns in a largesparse graph. Data Mining and Knowledge Discovery, 11(3):243–271,2005.

[31] Z. M. Ozsoyoglu L. Sheng and G. Ozsoyoglu. A graph query languageand its query processing. In ICDE, 1999.

[32] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. Heller-stein. Distributed graphlab: a framework for machine learning and datamining in the cloud. Proc. VLDB Endow., 5(8):716–727, 2012.

[33] J. Paredaens M. Gyssens and D. van Gucht. A graph-oriented objectdatabase model. In PODS, page 417424, 1990.

[34] S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo. Capturing Topology inGraph Pattern Matching. PVLDB, 2012.

[35] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn,N. Leiser, and G. Czajkowski. PREGEL: A System for Large-ScaleGraph Processing. SIGMOD, 2010.

[36] S. Melnik, H. G.-Molina, and E. Rahm. Similarity Flooding: A VersatileGraph Matching Algorithm and its Application to Schema Matching.ICDE, 2002.

[37] R. Milner. Communication and Concurrency. Prentice Hall, 1989.

11

Page 12: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

[38] T. Milo and D. Suciu. Index structures for path expressions. LectureNotes in Computer Science, 1540:277–295, 1999.

[39] M. Mongiovı, R. Di Natale, R. Giugno, A. Pulvirenti, A. Ferro, andR. Sharan. SIGMA: A Set-Cover-Based Inexact Graph Matching Algo-rithm. J. Bioinfo. and Comp. Bio., 2010.

[40] W. Moustafa, A. Deshpande, and L. Getoor. Ego-centric graph patterncensus. In ICDE, 2012.

[41] M. Newman. Networks: An Introduction. Oxford University Press, 2010.

[42] S. Nijssen and J. Kok. A quickstart in frequent structure mining canmake a difference. In Proc. of 2004 ACM Int. Conf. on KnowledgeDiscovery in Data Mining (KDD’04), pages 647–652, 2004.

[43] D. Xin P. Zhao, X. Li and J. Han. Graph cube: On warehousing andolap multidimensional networks. In SIGMOD, 2011.

[44] E. Petrakis and C. Faloutsos. Similarity searching in medical imagedatabases. Knowledge and Data Engineering, 9(3):435–447, 1997.

[45] H. Shang, Y. Zhang, X. Lin, and J. Yu. Taming Verification Hardness:An Efficient Algorithm for Testing Subgraph Isomorphism. PVLDB,2008.

[46] D. Shasha, J. T.-L. Wang, and R. Giugno. Algorithmics and Applica-tions of Tree and Graph Searching. PODS, 2002.

[47] A. Shokoufandeh, S. Dickinson, K. Siddiqi, and S. Zucker. Indexingusing a spectral encoding of topological structure. In Proc. of IEEE Int.Conf. on Computer Vision and Pattern Recognition (CVPR’99), pages2491–2497, 1999.

[48] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient SubgraphMatching on Billion Node Graphs. PVLDB, 2012.

[49] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraphmatching on billion node graphs. Proc. VLDB Endow., 5(9):788–799,2012.

12

Page 13: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

[50] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient Aggregation forGraph Summarization. SIGMOD, 2008.

[51] Y. Tian and J. M. Patel. TALE: A Tool for Approximate Large GraphMatching. ICDE, 2008.

[52] R. Ting and J. Bailey. Mining minimal contrast subgraph patterns. InProc. of 2006 SIAM Int. Conf. on Data Mining (SDM’06), 2006.

[53] H. Tong, C. Faloutsos, B. Gallagher, and T. Eliassi-Rad. Fast Best-Effort Pattern Matching in Large Attributed Graphs. KDD, 2007.

[54] J. R. Ullmann. An Algorithm for Subgraph Isomorphism. J. ACM,1976.

[55] N. Vanetik, E. Gudes, and S. Shimony. Computing frequent graph pat-terns from semistructured data. In Proc. of 2002 Int. Conf. on DataMining (ICDM’02), pages 458–465, 2002.

[56] T. Washio and H. Motoda. State of the art of graph-based data mining.SIGKDD Explorations, 5:59–68, 2003.

[57] X. Yan P. S. Yu Y. Sun, J. Han and T. Wu. Pathsim: Meta path-basedtop-k similarity search in heterogeneous information networks. In Proc.of 2011 Int. Conf. on Very Large Data Bases (VLDB’11), 2011.

[58] X. Yan, H. Cheng, P. S. Yu, and J. Han. Mining significant graphpatterns by leap search. In Proc. of 2008 ACM-SIGMOD Int. Conf. onManagement of Data (SIGMOD’08), pages 433 – 444, 2008.

[59] X. Yan and J. Han. gSpan: Graph-Based Substructure Pattern Mining.ICDM, 2002.

[60] X. Yan and J. Han. CloseGraph: Mining closed frequent graph patterns.In Proc. of 2003 Int. Conf. on Knowledge Discovery and Data Mining(KDD’03), pages 286–295, 2003.

[61] X. Yan, P. S. Yu, and J. Han. Graph Indexing: A Frequent Structure-Based Approach. SIGMOD, 2004.

[62] S. Zhang, J. Yang, and W. Jin. SAPPER: Subgraph Indexing and Ap-proximate Matching in Large Graphs. PVLDB, 2010.

13

Page 14: Graph Pattern Mining, Search and OLAP · Graph Pattern Mining, Search and OLAP Xifeng Yan ... (full graph similarity search), and subgraph approximate match ... to convert a large

[63] P. Zhao, J. Yu, and P. Yu. Graph Indexing: Tree + Delta >= Graph.VLDB, 2007.

[64] Q. Zhong, H. Li, J. Li, G. Xie, J. Tang, L. Zhou, and Y. Pan. AGauss Function Based Approach for Unbalanced Ontology Matching.SIGMOD, 2009.

[65] F. Zhu, X. Yan, J. Han, and P. S. Yu. gprune: a constraint pushingframework for graph pattern mining. In Proc. of the 11th Pacific-Asiaconf. on Advances in knowledge discovery and data mining, pages 388–400, 2007.

[66] L. Zou, L. Chen, J. Yu, and Y. Lu. A Novel Spectral Coding in a LargeGraph Database. EDBT, 2008.

14