45
Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Embed Size (px)

Citation preview

Page 1: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Graph Substructure Search

Xuemin LinSchool of Computer Science and Engineering

University of New South WalesSydney, Australia

Page 2: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Applications of Graphs• Chem-informatics Chemical Compounds (Small Size)

• Bio-informatics Protein Interaction Networks (Medium Size)

• Other Applications Social Networks (Large Size) … …

Page 3: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Fundamental Problems in Graph Database• Given a graph database D = {g1, ..., gn} of n data

graphs and a query graph q,

Substructure SearchRetrieve all data graphs which contain q.

Application : Chemical Compounds’ substructure Identification et al .

Supstructure Search Retrieve all data graphs which are contained by q.

Application : Molecule Function Prediction et al.

Page 4: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Substructure Search

Page 5: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Similarity Search?

Input Mistake

Exploration Queries

......

Page 6: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Outline

• Substructure Search (VLDB08)• Substructure Similarity Search

(SIGMOD10)• Superstructure Search (SSDBM10)• Superstructure Similarity Search

(ICDE2010)• Conclusions and Remarks

Page 7: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Substructure Search: gIndex (SIGMOD04) Index a set F of features from D.

, sup(f): set of graph ids in D contain f

Filtering:

Verification: verify each data graph in Cq.

sup( )q

f q f F

C f

Ff

Page 8: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

gIndex (SIGMOD04)

q

g1 g2 g3

ID-List: {g1,g2}

Feature A:

PrunedPass (Feature A) Pass (Feature A)

False Positive Answer

Filtering:

Verification:

Page 9: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Tree+Delta (VLDB07)Objective: reduce the costs for building index

Observation: lower costs for generating tree-based features. most (95%) frequent subgraphs (in sparse graphs) are

trees.

Tree+Delta: • select frequent tree features, and then• add a small number of effective subgraphs.

Page 10: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

FG Index (SIGMOD07)Objective: Index only query processing.

q is a frequent subgraph: if q is indexed, then return sup(q). if a supergraph q’ of q is indexed, no verification for sup

(q’).

q is not a frequent subgraph: # of verifications is bounded by

|| D

Page 11: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

QuickSI (VLDB08): our workObjective: develop efficient verification algorithm speed up both verification and filtering.

QuickSI: An efficient verification algorithm.―Encode query graphs: terminate earlier.―Enforce connectivity―Three novel pruning techniques

Up to orders of magnitude speed up.

Page 12: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

QuickSI

q

g

Depth-First Traversal

1

1

2

23 4

567

3 45

67

Forwarding

Backtracking

Page 13: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

QuickSI

q

g

1

1

2

23

45 6

7

3 45

67

Depth-First Traversal

Forwarding

Backtracking

Page 14: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

QuickSI

q

g

1

1

2

23 4

567

3 45

67

Depth-First Traversal

Forwarding

Page 15: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

QuickSI

q

g

Access infrequent labels as early as possible

Depth-First Traversal

Page 16: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

QuickSI

q

g

1

11 1

111 1

1

1

1

Access infrequent labels as early as possible

Depth-First Traversal

Page 17: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Synchronized Depth-First Traversal

QuickSI

q

g

1

1

2

12 2

222

Sparse Graph!2x5=10 possible matching pairs

Access infrequent labels as early as possible

Retain connectivity

Page 18: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

QuickSI

q

g

1

1 2

1

2

2

Sparse Graph!ONLY 2 possible matching pairs

Access infrequent labels as early as possible

Retain connectivity

Depth-First Traversal

Page 19: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

QuickSI

q

g

1

1

2

23 4

567

3 45

67

Access infrequent labels as early as possible

Effectively use degree information

Deg=3

Deg=2Stop here

2

2

Retain connectivity

Depth-First Traversal

Page 20: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

QuickSI

q

g

1

1

2

23 4

567

3 45

67

Access infrequent labels as early as possible

Retain connectivity

Effectively use degree information

Deg=3

Deg=2Stop here

21

Deg=3Continue

Determine the access order for q.

Depth First Traversal

Page 21: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Experimental ResultsSettings

Notations Filtering Verification

GSI gIndex (SIGMOD ’04) QuickSI

SSI Swift Index (This Paper) QuickSI

FG FG Index (SIGMOD ’07)

AIDS Antiviral dataset, a popular benchmark, 43k chemical bonds, “C” “N” “O” are the most frequent labels.The data sets and query sets are same as in gIndex and FG Index

Page 22: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Experiments – Response Time

Construction Time

# of Features

Index Size

FG 167 1641 12.5M

GSI 146.6 3276 13M

SSI 26.6 462 5.5

Real dataset Large real dataset

Construction Time

# of Features

Index Size

FG 2188 7100 53.8M

GSI 306.2 4394 13M

SSI 170 922 11.8

Page 23: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Substructure Similarity Search – Grafil (SIGMOD07)

Maximum Common Subgraph MCS: Given g1 and g2, the common graph of g1 and g2 with the maximal number of edges, mcs(g1, g2).

Grafil: find all g in D s.t. (|q| - |mcs(q, g)|) ≤ σ

(|q| - |mcs(q, g)|): number of missing edges

Some variants...

Page 24: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Substructure Similarity SearchSubgraph Similarity

?

Page 25: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Connected Substructure Similarity Search: our work (SIGMOD10)

Maximum Connected Common Subgraph MCCS: Given g1 and g2, the connected common graph of g1 and g2 with the maximal number of edges, mccs(g1, g2).

dis (q, g) := |q| - |mccs(q, g)|

Goal: find all g in D s.t. dis(q,g) ≤σ

NP-Complete.

Page 26: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Filtering: triangular inequality?

dis(Q,D)+dis(D,F) ≥ dis(Q,F) dis(Q, D) ≥ dis(Q,F) – dis(D,F)

dis(Q,D)

dis(Q,F) dis(D,.F)

Query (Q) Feature(F) Data (D)

Page 27: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

dis(Q,D)+dist(D,F) ≥ dist(Q,F) ?

1

Filtering: triangular inequality?

Query (Q) Data (D)

1dis(Q,D)

Page 28: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Feature(F) Data (D)

1 2

dist(Q,D)+dist(D,F) ≥ dist(Q,F) ?

Similarity Search (triangular inequality)

2dist(F,D)

Page 29: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Filtering: triangular inequality?

dis(Q,D)+dis(D,F) ≥ dis(Q,F) ---- HOLD!

2dis(Q,F)

Query (Q) Feature(F)

1 2 2

Page 30: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

0 1 3

Query(Q) Feature(F) Data (D)

dist(Q,D)

dis(Q,D)+dis(D,F) ≥ dis(Q,F) X

Triangular inequality: not always hold

dist(D,F)dist(Q,F)

Page 31: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Connectivity Dominance

Connectivity Dominance:

The connectivity of mccs(g1, g2) dominates the connectivity of g2 if there is a subgraph isomorphic mapping F from mccs(g1, g2) to g2 such that if removing a set S of edges in mccs(g1, g2) causes mccs(g1, g2) disconnected, then removing F(S) always causes g2 disconnected.

Theorem. Given three graphs g1, g2, and g3, if the connectivity of mccs(g1, g2) dominates g2 or the connectivity of mccs(g2, g3) dominates g2, then dist(g1, g3) ≤ dist(g1, g2) + dist(g2, g3).

Remark: Linear Algorithm given embeeding.

Page 32: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

dist(Q,F)+dist(F,D) ≥ dist(Q,D)

Validation Rule 1 – index onlydist(Q,F)+dist(F,D) ≤ => dist(Q,D) ≤(if mccs(Q, F) dominates F or mccs(F, D) dominates F)

dist(Q,D)+dist(D,F) ≥ dist(Q,F)

Pruning Rule 1:dist(Q,F)-dist(D,F)> => dist(Q,D)>(if mccs(D, F) dominates D)

dist(F,Q)+dist(Q,D) ≥ dist(F,D)

Pruning Rule 2:dist(F, D)-dist(F, Q)> => dist(Q,D)>(if mccs(F, Q) dominates Q)

Page 33: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Basic idea:

1. enumerate sub-spanning trees of query graph such that the # of missing edges ≤ ; try to terminate the algorithm as early as possible.2. sharing the enumeration costs by two ways: a. not enumerate every thing from scratch. b. once enumerated, keep enumerated spanning trees. – organized in a

binary tree to reduce storage space. 3. extend QucikSI [VLDB08].

Verification Algorithm

Page 34: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Experiments

Page 35: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Experiments

Page 36: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

cIndex [VLDB’07]: super structure search• Filtering-Verification Framework Filter false results by a feature-based index:

exclusion based. Verify each candidate against the query graph.

DatabaseDatabase IndexIndex QueryQuery

FilteringFiltering(gc)

(gb)

(ga) fa

fb

fc

q

Filtered!Filtered!

Filtered!Filtered!

Candidate!Candidate!VerificationVerificationAnswer!Answer!

Page 37: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

GPTree [EDBT’09]

• Enhanced Filtering-Verification Framework Share test cost in filtering and verification, respectively.

b a

c a a

b a

c a b

fa

(ga)

(gb)

b c

c a c(gc)

afb

c

b

a

a

a a b

(ga) (gb)

c

b

c

a

c

a

(gc)

sharingacross groups?

b

c

a c(fa) (fb)

Sharingbetween two phases?

sharingacross

suffixes?

Page 38: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

PrefIndex [SSDBM10]

• Sharing-based Filtering-Verification Framework Share test cost in filtering and verification, respectively.

b a

c a a

b a

c a b

fa

(ga)

(gb)

b c

c a c(gc)

afb

c

b

a

a

a a b

(ga) (gb)

c

b

c

a

c

a

(gc)

sharingacross groups

b

c

a c(fa) (fb)

sharingbetween two phases

No sharingacross

suffixes

Page 39: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Computation Sharing Cost Model

• Cost Gain (Computation Sharing Benefits) Given k master groups of data graphs, assume that

1) all data graphs in each group Gi contain a master feature fi ( 1 ≤ i ≤ k );2) the subgraph isomorphism test from fi to a query graph q is costfi;

The total cost gain (computation sharing benefits) from each master group Gi can be represented as follows:

• Maximized Gain Cluster the database into a disjoint set of master groups such

that the total gain is maximized (NP-hard).

Page 40: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Efficiency Test• Database and Query Sets Database: AIDS10K; Query Sets: Q20, Q40, Q60 Q80, Q80+;

Page 41: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Superstructure Similarity Search: our work (ICDE10)

Given a q and a g, dis(q, g) = |g| − |mccs(q, g)|.

Superstructure Similarity Search: find all g from D such that dis(q, g) ≤ σ.

Note: dis(q, g) = |q| − |mccs(q, g)| in substructure similarity search.

Observations: filtering framework in SIGMOD10 is immediately applicable. techniques in SIGMOD10 may not be effective for a nearly “super-containment” relationship. Sharing is possible just like PrefIndex.

Page 42: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

SG-Enum Index (ICDE2010)Key Ideas:1. For a g, enumerate all subgraphs with at most σ edges removed,

σ-missing subgraphs.2. dis(q, g) ≤σ if and only if q contains a σ-missing subgraph.

Key issues:

Automorphic subgraphs?

Prefix-sharing? in one g among different data graphs

Query processing?

Page 43: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

SG-Enum Index (ICDE2010)Top-down Construction:1. Enumerate all σ-missing subgraphs. 2. Iteratively, choose an edge as follows:

a) Always select an edge contained by most σ-missing subgraphs.b) Split the group into 2: one contain the edge and another does not contain the

edge.

Bottom-up Construction:1. Generate a sequence for each σ-missing subgraph.2. Merge the prefixes by chance.

Bottom-up among data graphs.

Query algorithm: extends QuickSI.

Page 44: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Experiments – Query Response Time

Page 45: Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Conclusion and RemarksSubstructure search and its similarity searchSuperstructure search and its similarity search.(VLDB08, ICDE10, SIGMOD10, SSDBM10)

Issues: Similarity measures? Large data graphs?