- Home
- Documents
- Frequent Subgraph Mining - NC State Computer Science ?· Frequent Subgraph Mining (FSM) Outline...

View

212Download

0

Embed Size (px)

Transcript

Frequent SubgraphMining

Frequent Subgraph Mining (FSM) Outline

FSM Preliminaries

FSM Algorithms

gSpan complete FSM on labeled graphs

SUBDUE approximate FSM on labeled graphs

SLEUTH FSM on trees

Review

FSM In a Nutshell

Discovery of graph structures that occur a significant number of times across a set of graphs

Ex.: Common occurrences of hydroxide-ion

Other instances:

Finding common biological pathways among species.

Recurring patterns of humans interaction during an epidemic.

Highlighting similar data to reveal data set as a whole.

OO

SS

OO

OO OO

HHHHHH

CC

HH

HH CC

OO

OO HH

CC

OO

OO OO

HH HH

NN

HH

HH HH

Sulfuric Acid Acetic Acid

Carbonic Acid

Ammonia

FSM Preliminaries

Support is some integer or frequency

Frequent graphs occur more than support number of times.

O-H present in inputs frequent if support

What Makes FSM So Hard?

Isomorphic graphs have same structural properties even though they may look different.

Subgraph isomorphism problem: Does a graph contain a subgraphisomorphic to another graph?

FSM algorithms encounter this problem while buildings graphs.

This problem is known to be NP-complete!

Isomorphic under A,B,C,D labeling

BBAA

DDCC

BBAA

CCDD

Pattern Growth Approach

Underlying strategy of both traditional frequent pattern mining and frequent subgraph mining

General Process:

candidate generation: which patterns will be considered? For FSM,

candidate pruning: if a candidate is not a viable frequent pattern, can we exploit the pattern to prevent unnecessary work?

subgraphs and subsets exponentiate as size increases!

support counting: how many of a given pattern exist?

These algorithms work in a breadth-first or depth-first way. Joins smaller frequent sets into larger ones. Checks the frequency of larger sets.

Pattern Growth Approach Apriori

Apriori principle: if an itemset is frequent, then all of its subsets are also frequent. Ex. if itemset {A, B, C, D} is frequent, then {A, B} is frequent. Simple proof: With respect to frequency, all sets trivially contain their subsets,

thus frequency of subset >= frequency of set. Same property applies to (sub)graphs!

Apriori algorithm exploits this to prune huge sections of the search space!

A B C

AB AC BC

ABC

If A is infrequent, nosupersets with A canbe frequent!

FSM Algorithms Discussed

gSpan

complete frequent subgraph mining

improves performance over straightforward apriori extensions to graphs through DFS Code representation and aggressive candidate pruning

SUBDUE

approximate frequent subgraph mining

uses graph compression as metric for determining a frequently occuring subgraph

SLEUTH

complete frequent subgraph mining

built specifically for trees

FSM R package

R package for FSM is called subgraphMining

To import: install.packages(subgraphMining)

Package contains: gSpan, SUBDUE, SLUETH.

Also contains the following data sets:

cslogs

metabolicInteractions.

To load the data, use the following code:# The cslogs data set

data(cslogs)# The matabolicInteractions datadata(metabolicInteractions)

FSM Outline

FSM Preliminaries

FSM Algorithms

gSpan

SUBDUE

SLEUTH

Review

gSpan: Graph-Based Substructure PatternMining

Written by Xifeng Yan & Jiawei Han in 2002.

Form of pattern-growth mining algorithm.

Adds edges to candidate subgraph

Also known as, edge extension

Avoid cost intensive problems like

Redundant candidate generation

Isomorphism testing

Uses two main concepts to find frequent subgraphs

DFS lexicographic order

minimum DFS code

gSpan Inputs

Set of graphs, support

Graph of form = (,, , )

, vertex and edge sets

vertex labels

edge labels

label sets need not be one-to-one

CC

OO

OO OO

HH HH = {,,}

= {singlebond,doublebond}

gSpan Components

Depth-first Search (DFS) Code

DFS Lexicographic Order

minimal DFS code

structuredgraph representation for building, comparing

canonical comparisonof graphs

selection, pruning of subgraphs

Strategy: build frequent subgraphs bottom-up, using DFS code as regularized representation

eliminate redundancies via minimal DFS codes based on code lexicographic ordering

Depth First Search Primer

Todo?

gSpan: DFS codes

0 (0,1,X,a,Y)1 (1,2,Y,b,X)2 (2,0,X,a,X)3 (2,3,X,c,Z)4 (3,1,Z,b,Y)5 (1,4,Y,d,Z)

Format: (, ,,(, ),)

, vertices by time of discovery,- vertex labels of ,

(, )

edge label between ,

< : forward edge > : back edge

Edge # Code

Vertex discoverytimes

X

X

Z

Y

Z

a

b

c

d

a

b

0

1

2

3

4

DFS Code: sequence of edgestraversed during DFS

DFS Code: Edge Ordering

Edges in code ordered in very specific manner, corresponding to DFS process

= (, ), = (, )

appears before in code

Ordering rules:

1. if = and <

from same source vertex, traversed before in DFS

2. if < and =

is a forward edge and traversed as result of traversal

3. if and ,

ordering is transitive

DFS Code: Edge Ordering Example

Rule applications by edge #

0 1(Rule 2)

1 2(Rule 2)

0 2(Rule 3)

2 3(Rule 1)

Exercise: what others?

X

X

Z

Y

Z

a

b

c

d

a

b

0

1

2

3

4

0 (0,1,X,a,Y)1 (1,2,Y,b,X)2 (2,0,X,a,X)3 (2,3,X,c,Z)4 (3,1,Z,b,Y)5 (1,4,Y,d,Z)

Edge # Code

Edge ordering can be recorded easily during the DFS!

Graphs have multiple DFS Codes!

X

X

Z

Y

Z

a

b

c

d

a

b X

X

Z

Y

Z

a

b

c

d

a

b

0

1

2

3

4X

Y

ZX

Z

a

a

c

d

b

b

0

1

2

3

4

Y

X

Z

X

Z

a

a

b d

b

c

0

1

2

34

Exercise: Write the 2 rightmost graphs using DFS code

solution to redundant DFS codes: lexical ordering, minimal code!

DFS Lexicographic Ordering vs. DFS Code

DFS code: Ordering of edge sequence of a particular DFS

E.g. DFSs that start at different vertices may have different DFS codes

Lexicographic ordering: ordering between different DFS codes

DFS Lexicographic Ordering

Given lexicographic ordering of label set ,

Given graphs , (equivalent label sets).

Given DFS codes

= code , = ,, ,

= code , = ,, ,

(assume )

iffeitherofthefollowingaretrue: , 0 min , suchthat

= for < and

= 0

DFS Lex. Ordering: Edge Comparison

Given DFS codes

= code , = ,, ,

= code , = ,, ,

(assume )

Given such that = for <

Given = , , ,, , ,

= , , ,, , ,

if one of the following cases

Case 1: Both forward edges, AND

Case 2: Both back edges, AND

Case 3: back, forward

Edge Comparison: Case 1 (both forward)

Both forward edges, AND one of the following:

< (edge starts from a later visited vertex)

Why is this (think about DFS process)?

= AND labels of lexicographically less than labels of , in order of tuple.

Ex: Labels are strings, = __, __, m, e, x , = (__, __,m, u,x)

m = m, e < u

Note: if both forward edges, then = Reasoning: all previous edges equal, target vertex discovery times are the same

Edge Comparison: Case 2 (both back)

Both back edges, AND one of the following:

< (edge refers to earlier vertex)

= AND edge label of lexicographically less than

Note: given that all previous edges equal, vertex labels must also be equal

Note: if both back edges, then = Reasoning: all previous edges equal, source vertex discovery times are the same.

0 (0,1,X,a,Y)1 (1,2,Y,b,X)2 (2,0,X,a,X)3 (2,3,X,c,Z)4 (3,1,Z,b,Y)5 (1,4,Y,d,Z)

Edge # Code (A)

(0,1,Y,a,X)(1,2,X,a,X)(2,0,X,b,Y)(2,3,X,c,Z)(3,1,Z,b,X)(0,4,Y,d,Z)

Code (B)

(0,1,X,a,X)(1,2,X,a,Y)(2,0,Y,b,X)(2,3,Y,b,Z)(3,0,Z,c,X)(2,4,Y,d,Z)

Code (C)

X

X

Z

Y

Z

a

b

c

d

a

b

0

1

2

3

4

X

Y

ZX

Z

a

a

c

d

b

b

0

1

2

3

4

Y

X

Z

X

Z

a

a

b d

b

c

0

1

2

34

0 (0,1,X,a,Y)1 (1,2,Y,b,X)2 (2,0,X,a,X)3 (2,3,X,c,Z)4 (3,1,Z,b,Y)5 (1,4,Y,d,Z)

(0,1,Y,a,X)(1,2,X,a,X)(2,0,X,b,Y)(2,3,X,c,Z)(3,1,Z,b,X)(0,4,Y,d,Z)

(0,1,X,a,X)(1,2,X,a,Y)(2,0,Y,b,X)(2,3,Y,b,Z)(3,0,Z,c,X)(2,4,Y,d,Z)

X

X

Z

Y

Z

a

b

c

d

a

b

0

1

2

3