41
iGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu

iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

  • Upload
    nyoko

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques. VLDB ‘10. Jeffrey Xu Yu et. al. Presented by Tao Yu. Why I choose this paper. iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques. Disk-based Implementation technique Graph database - PowerPoint PPT Presentation

Citation preview

Page 1: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

iGraph: A Framework for Comparisons of Disk-BasedGraph Indexing Techniques

Jeffrey Xu Yu et. al.VLDB ‘10

Presented by Tao Yu

Page 2: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Why I choose this paper

•Disk-based•Implementation technique•Graph database•Application•Dataset

iGraph: A Framework for Comparisons of Disk-BasedGraph Indexing Techniques

Jeffrey Xu Yu et. al.VLDB ‘10

Presented by Tao Yu

Page 3: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Why I choose this paper

•Disk-based•Implementation technique•Graph database•Application•Dataset

Page 4: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Why they write this paper

•Provide a uniform test framework.•Binary executable wall clock time comparison is not fair.•Some algorithms are in-memory implemented while others are on-disk implemented.•Obtain real disk I/Os by bypassing OS disk cache.•Perform a large number of tests.

Why I choose this paper

•Disk-based•Implementation technique•Graph database•Application•Dataset

Page 5: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Why they write this paper

•Provide a uniform test framework.•Binary executable wall clock time comparison is not fair.•Some algorithms are in-memory implemented while others are on-disk implemented.•Obtain real disk I/Os by bypassing OS disk cache.•Perform a large number of tests.

Page 6: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Background

•ApplicationGraph isomorphism

•StreamA large number of small graphs

•Undirected labeled graphG1 = (V;E;Lv;L1e)

Why they write this paper

•Provide a uniform test framework.•Binary executable wall clock time comparison is not fair.•Some algorithms are in-memory implemented while others are on-disk implemented.•Obtain real disk I/Os by bypassing OS disk cache.•Perform a large number of tests.

Page 7: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Background

•ApplicationGraph isomorphism

•StreamA large number of small graphs

•Undirected labeled graphG1 = (V;E;Lv;L1e)

Page 8: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Related work

•Mining based approaches•Non-mining based approaches

•Size: #Edges

Background

•ApplicationGraph isomorphism

•StreamA large number of small graphs

•Undirected labeled graphG1 = (V;E;Lv;L1e)

Page 9: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Related work

•Mining based approaches•Non-mining based approaches

•Size: #Edges

Page 10: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

FG-Index

•IndexingAll frequent subgraphsAll infrequent edges

•QueryEnumerate a subset of subgraphs Verification-free strategy

gIndex

•IndexingAll frequent subgraphs (maxL)A subset of infrequent subgraphs (maxL)Discrimitive features

•QueryEnumerate all subgraphs (maxL)

Related work

•Mining based approaches•Non-mining based approaches

•Size: #Edges

Page 11: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

FG-Index

•IndexingAll frequent subgraphsAll infrequent edges

•QueryEnumerate a subset of subgraphs Verification-free strategy

gIndex

•IndexingAll frequent subgraphs (maxL)A subset of infrequent subgraphs (maxL)Discrimitive features

•QueryEnumerate all subgraphs (maxL)

Page 12: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

SwiftIndex

•IndexingAll frequent trees size up to maxLAll discriminative trees size up to maxLAll infrequent edges

•QueryPrefixQuickSI

SwiftIndex

•IndexingAll frequent trees size up to maxLAll discriminative trees size up to maxLAll infrequent edges

•QueryPrefixQuickSI

Tree+Δ

•IndexingAll frequent trees size up to maxL – 1All infrequent edgesGenerates graph features on the fly

•QueryEnumerate all subtrees (maxL)

FG-Index

•IndexingAll frequent subgraphsAll infrequent edges

•QueryEnumerate a subset of subgraphs Verification-free strategy

gIndex

•IndexingAll frequent subgraphs (maxL)A subset of infrequent subgraphs (maxL)Discrimitive features

•QueryEnumerate all subgraphs (maxL)

Tree+Δ

•IndexingAll frequent trees size up to maxL – 1All infrequent edgesGenerates graph features on the fly

•QueryEnumerate all subtrees (maxL)

Page 13: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

SwiftIndex

•IndexingAll frequent trees size up to maxLAll discriminative trees size up to maxLAll infrequent edges

•QueryPrefixQuickSI

Tree+Δ

•IndexingAll frequent trees size up to maxL – 1All infrequent edgesGenerates graph features on the fly

•QueryEnumerate all subtrees (maxL)

Page 14: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

C-Tree

•IndexingA hierarchical tree of graph closure

•QueryPseudo subgraph isomorphism test

SwiftIndex

•IndexingAll frequent trees size up to maxLAll discriminative trees size up to maxLAll infrequent edges

•QueryPrefixQuickSI

Tree+Δ

•IndexingAll frequent trees size up to maxL – 1All infrequent edgesGenerates graph features on the fly

•QueryEnumerate all subtrees (maxL)

GraphGrep

•IndexingAll paths (maxL)

•QueryEnumerate all paths (maxL)

Page 15: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

C-Tree

•IndexingA hierarchical tree of graph closure

•QueryPseudo subgraph isomorphism test

GraphGrep

•IndexingAll paths (maxL)

•QueryEnumerate all paths (maxL)

Page 16: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Isomorphism Algorithms

•VF2•QuickSI

C-Tree

•IndexingA hierarchical tree of graph closure

•QueryPseudo subgraph isomorphism test

GraphGrep

•IndexingAll paths (maxL)

•QueryEnumerate all paths (maxL)

gCode

•IndexingVertex signature from neighborsGraph signature from vertexGCode-Tree<signature, count>

•QueryIndex level (graph signature)Object level (vertex signature)

Page 17: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Isomorphism Algorithms

gCode

•IndexingVertex signature from neighborsGraph signature from vertexGCode-Tree<signature, count>

•QueryIndex level (graph signature)Object level (vertex signature)

•VF2•QuickSI

Page 18: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Implementation

•GraphA list of vertices and a list of edges

•If a graph is less than the page sizeStore it as a tuple in a heap page

•Else Store it as a BLOB

•B+-tree for all graphs by graph ID•Other techniques

CAM code to encode featureDjb2 hash functionMini-page

Isomorphism Algorithms

gCode

•IndexingVertex signature from neighborsGraph signature from vertexGCode-Tree<signature, count>

•QueryIndex level (graph signature)Object level (vertex signature)

•VF2•QuickSI

Page 19: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Implementation

•GraphA list of vertices and a list of edges

•If a graph is less than the page sizeStore it as a tuple in a heap page

•Else Store it as a BLOB

•B+-tree for all graphs by graph ID•Other techniques

CAM code to encode featureDjb2 hash functionMini-page

Page 20: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Dataset

•Small sparseAIDS: 10000 graphs25.42 vertices and 27.40 edges51 vertex lables and 4 edge labels

•Small denseGraphGen: 10000 graphs7 vertices and 30 edges20 vertex lables and 20 edge labels

•LargePubChem: 1000000 graphs23.98 vertices and 25.76 edges81 vertex lables and 3 edge labels

Implementation

•GraphA list of vertices and a list of edges

•If a graph is less than the page sizeStore it as a tuple in a heap page

•Else Store it as a BLOB

•B+-tree for all graphs by graph ID•Other techniques

CAM code to encode featureDjb2 hash functionMini-page

Page 21: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Dataset

•Small sparseAIDS: 10000 graphs25.42 vertices and 27.40 edges51 vertex lables and 4 edge labels

•Small denseGraphGen: 10000 graphs7 vertices and 30 edges20 vertex lables and 20 edge labels

•LargePubChem: 1000000 graphs23.98 vertices and 25.76 edges81 vertex lables and 3 edge labels

Page 22: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Query sets

•For AIDS: the existing query sets Q4, Q8, · · · , Q24 can be downloaded from [3].•Each query set Qn contains 1000 graphs where each graph size is n.•For the other datasets: First, randomly select 1000 graphs from each dataset whose size is larger than or equal to 24. Then, for each graph g, we remove edges until g is still connected and contains 24 edges. This query set is called Q24.•In order to generate Q20, we remove edges from each graph in Q24 until the remaining graph contains 20 edges. We repeat this process to generate the remaining query sets.

Dataset

•Small sparseAIDS: 10000 graphs25.42 vertices and 27.40 edges51 vertex lables and 4 edge labels

•Small denseGraphGen: 10000 graphs7 vertices and 30 edges20 vertex lables and 20 edge labels

•LargePubChem: 1000000 graphs23.98 vertices and 25.76 edges81 vertex lables and 3 edge labels

Page 23: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Query sets

•For AIDS: the existing query sets Q4, Q8, · · · , Q24 can be downloaded from [3].•Each query set Qn contains 1000 graphs where each graph size is n.•For the other datasets: First, randomly select 1000 graphs from each dataset whose size is larger than or equal to 24. Then, for each graph g, we remove edges until g is still connected and contains 24 edges. This query set is called Q24.•In order to generate Q20, we remove edges from each graph in Q24 until the remaining graph contains 20 edges. We repeat this process to generate the remaining query sets.

Page 24: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Disk schedule

•LRU as buffer replacement algorithm•Page size: 8 K•FILE_FLAG_NO_BUFFERING

Query sets

•For AIDS: the existing query sets Q4, Q8, · · · , Q24 can be downloaded from [3].•Each query set Qn contains 1000 graphs where each graph size is n.•For the other datasets: First, randomly select 1000 graphs from each dataset whose size is larger than or equal to 24. Then, for each graph g, we remove edges until g is still connected and contains 24 edges. This query set is called Q24.•In order to generate Q20, we remove edges from each graph in Q24 until the remaining graph contains 20 edges. We repeat this process to generate the remaining query sets.

Page 25: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Disk schedule

•LRU as buffer replacement algorithm•Page size: 8 K•FILE_FLAG_NO_BUFFERING

Page 26: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Experiment

•The database construction cost of gIndex is comparable to all feature selectionmethods such as Tree+∆, FG-Index, and SwiftIndex.•“We have communicated with Xifeng Yan who first ignored the edge label. He did that simply in order to “make the problem more difficult.” Subsequent work imitated his setting without clear reason.”

Page 27: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Experiment

•More features, less candidates.•The gIndex performs the best.

Page 28: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Experiment

•For Q4, FG-Index performs the best since it exploits the verification-free strategy. •gCode performs the worst: 1) more candidates 2) lookups over the vertex signature dictionary need more buffering.

gCode

•IndexingVertex signature from neighborsGraph signature from vertexGCode-Tree<signature, count>

•QueryIndex level (graph signature)Object level (vertex signature)

Page 29: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Experiment

•As for C-Tree, the number of disk I/Os is slightly reduced compared with a small buffer size, since the database size of C-Tree is still larger than the buffer size, and tree traversal incurs the sequential flooding effect

C-Tree

•IndexingA hierarchical tree of graph closure

•QueryPseudo subgraph isomorphism test

Page 30: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Experiment

•gIndex is slightly slower than FG-Index and SwiftIndex due to slow subgraph enumeration from a query. •This fact indicates that the I/O cost must be carefully optimized to obtain good performance.

gIndex

•IndexingAll frequent subgraphs (maxL)A subset of infrequent subgraphs (maxL)Discrimitive features

•QueryEnumerate all subgraphs (maxL)

Page 31: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Experiment

•Only 37 frequent features. Almost all features in FG-Index, Tree+∆, and SwiftIndex are infrequent features.•gCode use signatures.•gIndex mines all infrequent and discriminative features of size up to 3.

gIndex

•IndexingAll frequent subgraphs (maxL)A subset of infrequent subgraphs (maxL)Discrimitive features

•QueryEnumerate all subgraphs (maxL)

gCode

•IndexingVertex signature from neighborsGraph signature from vertexGCode-Tree<signature, count>

•QueryIndex level (graph signature)Object level (vertex signature)

Page 32: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Experiment

•Drastic changes to gCode (I), C-Tree (I), and Tree+∆. •Frequent feature space is small.•Graph features reclaimed at small sizes are used for larger query sizes.

Tree+Δ

•IndexingAll frequent trees size up to maxL – 1All infrequent edgesGenerates graph features on the fly

•QueryEnumerate all subtrees (maxL)

Page 33: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Experiment

•FG-Index does not outperform gIndex even for Q4 since there exist no frequent features of size 4.•Queries in this dense synthetic dataset contain many cycles, and thus, the cost of mining graph features on the fly is very high.

Tree+Δ

•IndexingAll frequent trees size up to maxL – 1All infrequent edgesGenerates graph features on the fly

•QueryEnumerate all subtrees (maxL)

Page 34: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Experiment

•The number of index features used by FG-Index or SwiftIndex is much smaller than gIndex.•This result indicates that more features in the index simply do not guarantee better performance.

Page 35: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Experiment

•The trends of all curves are consistent with those for the number of I/Os.•gIndex shows the best performance in both cold and hot runs for a moderate dense dataset.

Page 36: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Experiment

•gCode performs the best for large query sizes with high density•gIndex performs comparatively better for a larger number of labels since its pruning cost is relatively more effective

Page 37: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Results for Large Graph Database

•Since both SeqScan and C-Tree require prohibitive times to finish the experiments even with large buffer sizes, we exclude them from a large graph database. •As for gCode, we can run experiments with a 1 GByte buffer and hot run; with smaller buffer sizes than 1 GByte and cold run, we are unable to finish the experiments within a week.

Page 38: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Results for Large Graph Database

•FG-Index’s pruning power is up to 13.09 times lower than gIndex, since FG-Index uses a strategy to select a subset of features in its index to minimize the filtering cost.

Page 39: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Results for Large Graph Database

•For Q4, FG-Index performs the best due to its verification-free strategy•For Q8 Q12, gIndex performs the best since its ∼pruning power is the best•For Q16 Q24, either SwiftIndex or FG-Index ∼performs the best since their posting list intersection costs are the least.

Page 40: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Results for Large Graph Database

•Although gIndex performs worse than SwiftIndex and FG-Index in the number of I/Os for large query sizes, it performs the best for all query sizes except Q4 due to a good combination of the lowest number of candidates and low disk I/O costs.

Page 41: iGraph : A Framework for Comparisons of Disk-Based Graph Indexing Techniques

Conclusion

•Overall winner: gIndex.•Large query on dense graph, we recommend gCode.•Souce code: http://www.igraph.or.kr/