27
1 gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Ahmed R. Nabhan University of Vermont

gSpan: Graph-based substructure patternjagonzalez/MLII/gSpan-11.pdf · Xifeng Yan and Jiawei Han. gSpan: graph-based substructure pattern mining. In IEEE ... to extend one new node,

Embed Size (px)

Citation preview

1

gSpan: Graph-based substructure pattern

mining

Authors: Xifeng Yan and Jiawei Han

Presented by: Ahmed R. Nabhan

University of Vermont

Copyright note:

This presentation was originally provided by Prof. Xifeng Yan upon request from student

Citation: Xifeng Yan and Jiawei Han. gSpan: graph-based

substructure pattern mining. In IEEE International Conference on Data Mining (ICDM), 2002

2

3

Outlines

Background

Problem Definition

Authors Contribution

Concepts behind gSpan

Experimental Result

Conclusion

4

Background

Frequent Subgraph Mining is an extension to

existing frequent pattern mining algorithms

A major challenge is to count how many instances

of a pattern are in the dataset

Counting instances might be easy for sets, but

subtle for graphs

Recall the graph isomorphism problem

Background

5

X W

U Y

V

(a)

X

W

U

YV

(b)

Two Isomorphic graph (a) and (b) with their mapping function (c)

Two graphs are isomorphic if one can find a mapping of nodes of the first graph to the second graph such that labels on nodes and edges are preserved.

f(V1.1) = V2.2f(V1.2) = V2.5f(V1.3) = V2.3f(V1.4) = V2.4f(V1.5) = V2.1

(c)

G1=(V1,E1,L1) G2=(V2,E2,L2)

1

2

3

4

51

2

34

5

6

Problem: Finding Frequent Subgraphs

Problem setting: similar to finding frequent itemsets for

association rule discovery

Input: Database of graph transactions

Undirected simple graph (no loops(?), no multiples edges)

Each graph transaction has labeled edges/vertices.

Transactions may not be connected

Minimum support thresholds

Output: Frequent subgraphs that satisfy the support

threshold, where each frequent subgraph is connected.

Xifeng Yan 7

Finding Frequent Subgraphs

8

Authors Contribution

Representing graphs as strings (like TreeMiner)

No candidate generation!

“It combines the growing and checking of frequent subgraphs into one procedure, thus accelerates the mining process.”

Really fast, still a standard baseline system that most rivals compare their systems to.

9

Concepts behind gSpan

The idea is to produces a Depth-First Search (DFS) codes for each edge in graphs

Edges are sorted according to lexicographic order of codes

Yan and Han proved that graph isomororphism can be tested for two graphs annotated with DFS codes

Starting with small graph patterns containing 1-edge, patterns are expanded systemically by the DFS search

Employ anti-monotonic property of graph frequency

Anti-Monotonicity of graph frequency

10

The frequency of a super-pattern is less than or equal tothe frequency of a sub-pattern. Copyright SIGMOD’08

11

Lexicographic Ordering in Graph

It can tell us the order of two graphs.

The design can help us build a similar hierarchy.

The design should guarantee easy-growing from one level to the lower level and easy-rolling-up from low level to higher level.

It may be difficult to have such design that no two nodes in this tree are same for graph case.

It can tell us whether the graph has been discovered.

And more, the most important, if a graph has been discovered, all its children nodes in the hierarchy must have been discovered.

12

Lexicographic Ordering in Graph

...

... ...

1-edge

2-edge

...3-edge ...

...

...

...

13

DFS code and Minimum DFS code

Depth First Tree and Forward/Backward Edge Set

14

DFS code and Minimum DFS code

We use a 5-tuple (vi, vj, l(vi), l(vj), l(vi,vj)) to represent an edge. (it may be redudant, but much easier to understand.)

Turn a graph into a sequence whose basic element is 5-tuple. Form the sequence in such an order:

to extend one new node, add the forward edge that connect one node in the old graph with this new node.

Add all backward edge that connect this new node to other nodes in the old graph

repeat this procedure.

15

DFS code

X

Y

X

Z

Z

a a

b

bc

d

v0v1v2

v3v4

X

Ya

e0: (0,1,x,y,a)

X

b

e1: (1,2,y,x,b)a

e2: (2,0,x,x,a)

Zc e3: (2,3,x,z,c)b

e4: (3,1,x,y,b)

Zd

e5: (1,4,x,z,d)

16

Minimum DFS code

Each Graph may have lots of DFS code (why?):one smallest lexicographic one is its Minimum DFS Code

Edge no. (B) (C) (D)

0 (0,1,x,y,a) (0,1,y,x,a) (0,1,x,x,a)

1 (1,2,y,x,b) (1,2,x,x,a) (1,2,x,y,b)

2 (2,0,x,x,a) (2,0,x,y,b) (0,1,y,x,a)

3 (2,3,x,z,c) (2,3,x,z,c) (2,3,y,z,a)

4 (3,1,z,y,b) (3,0,z,y,b) (3,1,z,x,c)

5 (1,4,x,z,d) (0,4,y,z,d) (2,4,y,z,d)

17

Graph Parent and its Children

X

Y

X

ZZ

a

b

c

a

Given a DFS code c0=(e0,e1,…,en)if c1=(e0,e1,…,en,ex)if c0<c1, then c0 is c1’s parent,c1 is c0’s child.

?

?

?

?

?

?

?

?

18

DFS Code Tree

...

... ...

1-edge

2-edge

...3-edge ...

...

...

...

19

Theorem

1. Given two graph G0 and G1, G0 is isomorphic

to G1 iff min_dfs_code(G0)=min_dfs_code(G1).

2. DFS Code Tree covers all graphs although

some tree nodes may represent the same graph.

(Covering)

3. Given a node in DFS Code Tree, if its DFS

code is not its minimum DFS code, prune this

node and its all descendants won’t change

“Covering”.

20

Algorithm

21

Algorithm

22

Experimental Result

23

Experimental Result

24

Conclusion

No Candidate Generation and False Test

Space Saving from Depth First Search DFM

Good Performance: using “memory Pool” and one

major counting improvement, it seems the

performance will be improved 5 times more. (but

need more testing).

25

Questions?

26

Exam Questions

Q1) Compare gSpan to Apriori-based algorithms Answer:

Unlike Apriori-based algorithms, gSpan does not generate candidate patterns and

tests for false positive pruning. This feature of gSpan is both time and space

efficient. Apriori-based algorithms must generate a candidate and then test for

isomorphism against graph dataset to calculate support. This test is costly. On

the other hand, gSpan does not test for isomorphism!

Q2) What are the main concepts behind gSpan Answer:

- Using Depth-First-Search (DFS) codes to label graph edges

- Employing anti-monotonic property of sub-graph frequency

- Pattern growths and pruning

27

Exam Questions (cont.)

Q3) Please similar and different features of gSpan

and TreeMiner. Answer:

- Both algorithms employ string representation of graphs

- TreeMiner generates candidate patterns and then find support, while

gSpan expand frequent patterns directly

- gSpan is generally more applicable (can handle both trees and graphs)