Indexing Sparse graphs for Similarity Search - ijettcs.orgijettcs.org/Volume2Issue2/IJETTCS-2013-04-14-099.pdf · judged by the size of their maximum common subgraph, ... graph similarity

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected]

Volume 2, Issue 2, March – April 2013 ISSN 2278-6856

Volume 2, Issue 2 March – April 2013 Page 234

Abstract: Data which is schemaless, such as chemical compounds, can be efficiently modeled using graph structure. The project focuses on indexing the graph structure for similarity search. It includes an efficient indexing mechanism. The project achieves this by decomposing graphs into Adjacent Tree patterns. Using these –Adjacent Tree patterns and the lower bound estimation of their edit distance we can perform filtering to obtain the candidate set of graphs for further similarity search. The project focuses on using a graph data set consisting of hydrocarbons for the same. In this way the project helps to implement a better and more optimized similarity search. Keywords: Graph indexing, Similarity search, Adjacent tree

1. INTRODUCTION Whether two graphs are similar to each other can be judged by the size of their maximum common subgraph, but only theoretically, since the sub graph isomorphism test has been proved to be an NP-Complete problem. Sequential searching from a large set of graphs introduces a huge computational cost. Due to this low efficiency of a sequential search, a filter-and verification method is usually employed to speed up the search efficiency of graph similarity matching over a graph set and an index on the graph set can be used to filter the graph set to reduce candidates. Indexing graphs using k-adjacent tree structure is an efficient method of indexing.

2. RELATED WORK Cheng at al have proposed a nested inverted-index called FG-index to avoid candidate verification by exploiting frequent subgraphs and edges as indexing features. However, when encountering infrequent queries, the method performs poorly, as infrequent subgraphs are not incorporated into the FG-Index.

Wang et al have proposed the technique for decomposing the graph into k-adjacent tree. However the lemma stated is ambiguous.

3. RELATED WORK Focus is on removing the redundancies in the k-adjacent tree method for indexing sparse graphs containing hydrocarbons. The large number of C-H bonds in a hydrocarbon is not considered. If the compound is C3H8, i.e., Propane its structure is:

Figure 1: Actual representation

However neglecting the C-H bonds we store it as follows:

Figure 2: Our Representation

Also for decomposition of any given graph the value of k has been set to 1 for obtaining the adjacent sets. For the above compound, the following to indexes are available in the 1-ATS:

Figure 3(a)

Figure 3(b)

Figure 3: 1-ATS of our representation The frequency count for the indexes generated above is as follows:

Indexing Sparse graphs for Similarity Search

Aditya Ojha1, Sagar Patil2, Arun Rajeevan3, Sourav Das4

1,2,3,4 K.K.Wagh College of Engineering & Research,

Nashik, Maharashtra 422003, India




Table 1: Frequency count Index Structure Frequency Count CC1 2 CC1C1 1

The graphs are evaluated for similarity on the basis of the following lemma:

Edit Distance: The Graph Edit Distance between two graphs G1 and G2 is the minimum number of GEOs needed to transform G1 to a graph isomorphic to G2. The definition of edit distance of two graphs gives us a measurement to quantify the difference of two graphs. The GEO can be one of the following six operations: 1. Delete an edge from the graph. 2. Insert an edge between two disconnected vertices. 3. Delete an isolated vertex from the graph. 4. Insert an isolated vertex into the graph. 5. Change the label of a vertex. 6. Change the label of an edge. Consider the following two graphs: Query Graph:

Figure 4: A sample Query Graph

Table 2: Frequency count of 1-ATS in Query Graph Index Structure Frequency Count CC1 1 CC1C1 2 CC1C1O2 1 CC1N1 1

Graph in dataset:

Figure 5: A sample Graph from the dataset

Table 3: Frequency Count of 1-ATS in sample graph Index Structure Frequency Count

CC1O2 1 CC1C1 1 CC1N1 1

Here |=2 |V(Q)| = 7 Hence for graph edit distance of 3 both graphs would be similar. The following figure represents a basic block diagram for the proposed system.

Figure 6: Block Diagram

4. CONCLUSION By decomposing the graphs into small pieces (1-ATs), and pairing-up these pieces, we evaluate the global similarity between them. In order to seek for a compromise between frequent-subgraph-based indexing methods and graph-decomposition-based indexing methods, we use the redundant subtree structure: 1-AT pattern for index construction. 1-AT records more structural information on each vertex than a normal graph-decomposition- based indexing method, and while maintaining the simple structure of tree. By calculating the number of common 1-ATs of two graphs, we can estimate the graph edit distance between them. This gives us a method for indexing and candidate filtering in a graph set for similarity matching.

ACKNOWLEDGEMENT We would like to express our sincere gratitude and appreciation to our project guide Prof. Rutuja Jadhav for the patience, guidance, help and for being our greatest source of information during this project. We also thank Prof. Kamlapur S., for providing time and helpful comments for our work.

References [1] T.H. Cormen, “Np Completeness,” Introduction to

Algorithms, W. Yu, ed., second ed., vol. 7, pp. 620-630. China Machine Press, 2007.




[2] Efficiently Indexing Large Sparse Graphs for Similarity Search Guoren Wang, Bin Wang, Xiaochun Yang, Member, IEEE Computer Society, and Ge Yu, Member, IEEE IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

[3] Data Structures and Algorithms in Java (2nd Edition) by Robert Lafore

[4] Introductory Graph Theory by Gary Chartrand [5] Thinking in Java (4th Edition) Bruce Eckel [6] M. Kuramochi and G. Karypis, “Frequent Subgraph

Discovery,” Proc. 2001 IEEE Int’l Conf. Data Mining, pp. 313-320, 2001.

[7] S. Sarawagi and A. Kirpal, “Efficient Set Joins on Similarity Predicates,” Proc. ACM SIGMOD, pp. 743-754, 2004.

[8] D. Justice and A. Hero, “A Binary Linear Programming Formulation of the Graph Edit Distance,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1200-1214, Aug. 2006.

[9] O. Johansson, “Graph Decomposition Using Node Labels,” doctoral dissertation, Royal Inst. of Technology, 2001.

[10] Y. Tian and J.M. Patel, “Tale: A Tool for Approximate Large Graph Matching,” Proc. 24th Int’l Conf. Data Eng., pp. 963-972, 2008.

[11] H. Jiang, H. Wang, P.S. Yu, and S. Zhou, “Gstring: A Novel Approach for Efficient Search in Graph Databases,” Proc. 23rd Int’l Conf. Data Eng., pp. 566-575, 2007.

[12] L. Zou, L. Chen, J.X. Yu, and Y. Lu, “A Novel Spectral Coding in a Large Graph Database,” Proc. 11th Int’l Conf. Extending DatabaseTechnology, pp. 181-192, 2008.

[13] D.W.Williams, J. Huan, and W. Wang, “Graph Database Indexing Using Structured Graph Decomposition,” Proc. 23rd Int’l Conf. Data Eng., pp. 976-985, 2007.

[19] D. Shasha, J.T.-L. Wang, and R. Giugno, “Algorithmics and Applications of Tree and Graph Searching,” Proc. 21st ACMSIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, pp. 39-52, 2002.

AUTHOR

Aditya Ojha is an U.G. student at KKWIEER, University of Pune. He is also an IBM Student Ambassador for TGMC. He is a member of CSI, his areas of interest are parallel and distributed systems, database theory, networking.

Sagar Patil is an U.G. student at KKWIEER, University of Pune. He is a member of CSI. His areas of interest are graph theory, advanced operating systems, analysis of algorithms.

Sourav Das is an U.G. student at KKWIEER, University of Pune. He is a member of CSI. His areas of interest are embedded software, P2P networks, data quality

Arun Rajeevan is an U.G. student at KKWIEER, University of Pune. His areas of interest are neural networks, databases, digital signal processing.

Documents

Indexing Sparse graphs for Similarity Search - ijettcs.orgijettcs.org/Volume2Issue2/IJETTCS-2013-04-14-099.pdf · judged by the size of their maximum common subgraph, ... graph similarity