Upload
amy-douglas
View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Graph Substructure Search
Xuemin LinSchool of Computer Science and Engineering
University of New South WalesSydney, Australia
Applications of Graphs• Chem-informatics Chemical Compounds (Small Size)
• Bio-informatics Protein Interaction Networks (Medium Size)
• Other Applications Social Networks (Large Size) … …
Fundamental Problems in Graph Database• Given a graph database D = {g1, ..., gn} of n data
graphs and a query graph q,
Substructure SearchRetrieve all data graphs which contain q.
Application : Chemical Compounds’ substructure Identification et al .
Supstructure Search Retrieve all data graphs which are contained by q.
Application : Molecule Function Prediction et al.
Substructure Search
Similarity Search?
Input Mistake
Exploration Queries
......
Outline
• Substructure Search (VLDB08)• Substructure Similarity Search
(SIGMOD10)• Superstructure Search (SSDBM10)• Superstructure Similarity Search
(ICDE2010)• Conclusions and Remarks
Substructure Search: gIndex (SIGMOD04) Index a set F of features from D.
, sup(f): set of graph ids in D contain f
Filtering:
Verification: verify each data graph in Cq.
sup( )q
f q f F
C f
Ff
gIndex (SIGMOD04)
q
g1 g2 g3
ID-List: {g1,g2}
Feature A:
PrunedPass (Feature A) Pass (Feature A)
False Positive Answer
Filtering:
Verification:
Tree+Delta (VLDB07)Objective: reduce the costs for building index
Observation: lower costs for generating tree-based features. most (95%) frequent subgraphs (in sparse graphs) are
trees.
Tree+Delta: • select frequent tree features, and then• add a small number of effective subgraphs.
FG Index (SIGMOD07)Objective: Index only query processing.
q is a frequent subgraph: if q is indexed, then return sup(q). if a supergraph q’ of q is indexed, no verification for sup
(q’).
q is not a frequent subgraph: # of verifications is bounded by
|| D
QuickSI (VLDB08): our workObjective: develop efficient verification algorithm speed up both verification and filtering.
QuickSI: An efficient verification algorithm.―Encode query graphs: terminate earlier.―Enforce connectivity―Three novel pruning techniques
Up to orders of magnitude speed up.
QuickSI
q
g
Depth-First Traversal
1
1
2
23 4
567
3 45
67
Forwarding
Backtracking
QuickSI
q
g
1
1
2
23
45 6
7
3 45
67
Depth-First Traversal
Forwarding
Backtracking
QuickSI
q
g
1
1
2
23 4
567
3 45
67
Depth-First Traversal
Forwarding
QuickSI
q
g
Access infrequent labels as early as possible
Depth-First Traversal
QuickSI
q
g
1
11 1
111 1
1
1
1
Access infrequent labels as early as possible
Depth-First Traversal
Synchronized Depth-First Traversal
QuickSI
q
g
1
1
2
12 2
222
Sparse Graph!2x5=10 possible matching pairs
Access infrequent labels as early as possible
Retain connectivity
QuickSI
q
g
1
1 2
1
2
2
Sparse Graph!ONLY 2 possible matching pairs
Access infrequent labels as early as possible
Retain connectivity
Depth-First Traversal
QuickSI
q
g
1
1
2
23 4
567
3 45
67
Access infrequent labels as early as possible
Effectively use degree information
Deg=3
Deg=2Stop here
2
2
Retain connectivity
Depth-First Traversal
QuickSI
q
g
1
1
2
23 4
567
3 45
67
Access infrequent labels as early as possible
Retain connectivity
Effectively use degree information
Deg=3
Deg=2Stop here
21
Deg=3Continue
Determine the access order for q.
Depth First Traversal
Experimental ResultsSettings
Notations Filtering Verification
GSI gIndex (SIGMOD ’04) QuickSI
SSI Swift Index (This Paper) QuickSI
FG FG Index (SIGMOD ’07)
AIDS Antiviral dataset, a popular benchmark, 43k chemical bonds, “C” “N” “O” are the most frequent labels.The data sets and query sets are same as in gIndex and FG Index
Experiments – Response Time
Construction Time
# of Features
Index Size
FG 167 1641 12.5M
GSI 146.6 3276 13M
SSI 26.6 462 5.5
Real dataset Large real dataset
Construction Time
# of Features
Index Size
FG 2188 7100 53.8M
GSI 306.2 4394 13M
SSI 170 922 11.8
Substructure Similarity Search – Grafil (SIGMOD07)
Maximum Common Subgraph MCS: Given g1 and g2, the common graph of g1 and g2 with the maximal number of edges, mcs(g1, g2).
Grafil: find all g in D s.t. (|q| - |mcs(q, g)|) ≤ σ
(|q| - |mcs(q, g)|): number of missing edges
Some variants...
Substructure Similarity SearchSubgraph Similarity
?
Connected Substructure Similarity Search: our work (SIGMOD10)
Maximum Connected Common Subgraph MCCS: Given g1 and g2, the connected common graph of g1 and g2 with the maximal number of edges, mccs(g1, g2).
dis (q, g) := |q| - |mccs(q, g)|
Goal: find all g in D s.t. dis(q,g) ≤σ
NP-Complete.
Filtering: triangular inequality?
dis(Q,D)+dis(D,F) ≥ dis(Q,F) dis(Q, D) ≥ dis(Q,F) – dis(D,F)
dis(Q,D)
dis(Q,F) dis(D,.F)
Query (Q) Feature(F) Data (D)
dis(Q,D)+dist(D,F) ≥ dist(Q,F) ?
1
Filtering: triangular inequality?
Query (Q) Data (D)
1dis(Q,D)
Feature(F) Data (D)
1 2
dist(Q,D)+dist(D,F) ≥ dist(Q,F) ?
Similarity Search (triangular inequality)
2dist(F,D)
Filtering: triangular inequality?
dis(Q,D)+dis(D,F) ≥ dis(Q,F) ---- HOLD!
2dis(Q,F)
Query (Q) Feature(F)
1 2 2
0 1 3
Query(Q) Feature(F) Data (D)
dist(Q,D)
dis(Q,D)+dis(D,F) ≥ dis(Q,F) X
Triangular inequality: not always hold
dist(D,F)dist(Q,F)
Connectivity Dominance
Connectivity Dominance:
The connectivity of mccs(g1, g2) dominates the connectivity of g2 if there is a subgraph isomorphic mapping F from mccs(g1, g2) to g2 such that if removing a set S of edges in mccs(g1, g2) causes mccs(g1, g2) disconnected, then removing F(S) always causes g2 disconnected.
Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g2, g3) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3).
Remark: Linear Algorithm given embeeding.
dist(Q,F)+dist(F,D) ≥ dist(Q,D)
Validation Rule 1 – index onlydist(Q,F)+dist(F,D) ≤ => dist(Q,D) ≤(if mccs(Q, F) dominates F or mccs(F, D) dominates F)
dist(Q,D)+dist(D,F) ≥ dist(Q,F)
Pruning Rule 1:dist(Q,F)-dist(D,F)> => dist(Q,D)>(if mccs(D, F) dominates D)
dist(F,Q)+dist(Q,D) ≥ dist(F,D)
Pruning Rule 2:dist(F, D)-dist(F, Q)> => dist(Q,D)>(if mccs(F, Q) dominates Q)
Basic idea:
1. enumerate sub-spanning trees of query graph such that the # of missing edges ≤ ; try to terminate the algorithm as early as possible.2. sharing the enumeration costs by two ways: a. not enumerate every thing from scratch. b. once enumerated, keep enumerated spanning trees. – organized in a
binary tree to reduce storage space. 3. extend QucikSI [VLDB08].
Verification Algorithm
Experiments
Experiments
cIndex [VLDB’07]: super structure search• Filtering-Verification Framework Filter false results by a feature-based index:
exclusion based. Verify each candidate against the query graph.
DatabaseDatabase IndexIndex QueryQuery
FilteringFiltering(gc)
(gb)
(ga) fa
fb
fc
q
Filtered!Filtered!
Filtered!Filtered!
Candidate!Candidate!VerificationVerificationAnswer!Answer!
GPTree [EDBT’09]
• Enhanced Filtering-Verification Framework Share test cost in filtering and verification, respectively.
b a
c a a
b a
c a b
fa
(ga)
(gb)
b c
c a c(gc)
afb
c
b
a
a
a a b
(ga) (gb)
c
b
c
a
c
a
(gc)
sharingacross groups?
b
c
a c(fa) (fb)
Sharingbetween two phases?
sharingacross
suffixes?
PrefIndex [SSDBM10]
• Sharing-based Filtering-Verification Framework Share test cost in filtering and verification, respectively.
b a
c a a
b a
c a b
fa
(ga)
(gb)
b c
c a c(gc)
afb
c
b
a
a
a a b
(ga) (gb)
c
b
c
a
c
a
(gc)
sharingacross groups
b
c
a c(fa) (fb)
sharingbetween two phases
No sharingacross
suffixes
Computation Sharing Cost Model
• Cost Gain (Computation Sharing Benefits) Given k master groups of data graphs, assume that
1) all data graphs in each group Gi contain a master feature fi ( 1 ≤ i ≤ k );2) the subgraph isomorphism test from fi to a query graph q is costfi;
The total cost gain (computation sharing benefits) from each master group Gi can be represented as follows:
• Maximized Gain Cluster the database into a disjoint set of master groups such
that the total gain is maximized (NP-hard).
Efficiency Test• Database and Query Sets Database: AIDS10K; Query Sets: Q20, Q40, Q60 Q80, Q80+;
Superstructure Similarity Search: our work (ICDE10)
Given a q and a g, dis(q, g) = |g| − |mccs(q, g)|.
Superstructure Similarity Search: find all g from D such that dis(q, g) ≤ σ.
Note: dis(q, g) = |q| − |mccs(q, g)| in substructure similarity search.
Observations: filtering framework in SIGMOD10 is immediately applicable. techniques in SIGMOD10 may not be effective for a nearly “super-containment” relationship. Sharing is possible just like PrefIndex.
SG-Enum Index (ICDE2010)Key Ideas:1. For a g, enumerate all subgraphs with at most σ edges removed,
σ-missing subgraphs.2. dis(q, g) ≤σ if and only if q contains a σ-missing subgraph.
Key issues:
Automorphic subgraphs?
Prefix-sharing? in one g among different data graphs
Query processing?
SG-Enum Index (ICDE2010)Top-down Construction:1. Enumerate all σ-missing subgraphs. 2. Iteratively, choose an edge as follows:
a) Always select an edge contained by most σ-missing subgraphs.b) Split the group into 2: one contain the edge and another does not contain the
edge.
Bottom-up Construction:1. Generate a sequence for each σ-missing subgraph.2. Merge the prefixes by chance.
Bottom-up among data graphs.
Query algorithm: extends QuickSI.
Experiments – Query Response Time
Conclusion and RemarksSubstructure search and its similarity searchSuperstructure search and its similarity search.(VLDB08, ICDE10, SIGMOD10, SSDBM10)
Issues: Similarity measures? Large data graphs?