Frequent Subgraph Mining - NC State Computer Science ?· Frequent Subgraph Mining (FSM) Outline ...…

  • View
    212

  • Download
    0

Embed Size (px)

Transcript

  • Frequent SubgraphMining

  • Frequent Subgraph Mining (FSM) Outline

    FSM Preliminaries

    FSM Algorithms

    gSpan complete FSM on labeled graphs

    SUBDUE approximate FSM on labeled graphs

    SLEUTH FSM on trees

    Review

  • FSM In a Nutshell

    Discovery of graph structures that occur a significant number of times across a set of graphs

    Ex.: Common occurrences of hydroxide-ion

    Other instances:

    Finding common biological pathways among species.

    Recurring patterns of humans interaction during an epidemic.

    Highlighting similar data to reveal data set as a whole.

    OO

    SS

    OO

    OO OO

    HHHHHH

    CC

    HH

    HH CC

    OO

    OO HH

    CC

    OO

    OO OO

    HH HH

    NN

    HH

    HH HH

    Sulfuric Acid Acetic Acid

    Carbonic Acid

    Ammonia

  • FSM Preliminaries

    Support is some integer or frequency

    Frequent graphs occur more than support number of times.

    O-H present in inputs frequent if support

  • What Makes FSM So Hard?

    Isomorphic graphs have same structural properties even though they may look different.

    Subgraph isomorphism problem: Does a graph contain a subgraphisomorphic to another graph?

    FSM algorithms encounter this problem while buildings graphs.

    This problem is known to be NP-complete!

    Isomorphic under A,B,C,D labeling

    BBAA

    DDCC

    BBAA

    CCDD

  • Pattern Growth Approach

    Underlying strategy of both traditional frequent pattern mining and frequent subgraph mining

    General Process:

    candidate generation: which patterns will be considered? For FSM,

    candidate pruning: if a candidate is not a viable frequent pattern, can we exploit the pattern to prevent unnecessary work?

    subgraphs and subsets exponentiate as size increases!

    support counting: how many of a given pattern exist?

    These algorithms work in a breadth-first or depth-first way. Joins smaller frequent sets into larger ones. Checks the frequency of larger sets.

  • Pattern Growth Approach Apriori

    Apriori principle: if an itemset is frequent, then all of its subsets are also frequent. Ex. if itemset {A, B, C, D} is frequent, then {A, B} is frequent. Simple proof: With respect to frequency, all sets trivially contain their subsets,

    thus frequency of subset >= frequency of set. Same property applies to (sub)graphs!

    Apriori algorithm exploits this to prune huge sections of the search space!

    A B C

    AB AC BC

    ABC

    If A is infrequent, nosupersets with A canbe frequent!

  • FSM Algorithms Discussed

    gSpan

    complete frequent subgraph mining

    improves performance over straightforward apriori extensions to graphs through DFS Code representation and aggressive candidate pruning

    SUBDUE

    approximate frequent subgraph mining

    uses graph compression as metric for determining a frequently occuring subgraph

    SLEUTH

    complete frequent subgraph mining

    built specifically for trees

  • FSM R package

    R package for FSM is called subgraphMining

    To import: install.packages(subgraphMining)

    Package contains: gSpan, SUBDUE, SLUETH.

    Also contains the following data sets:

    cslogs

    metabolicInteractions.

    To load the data, use the following code:# The cslogs data set

    data(cslogs)# The matabolicInteractions datadata(metabolicInteractions)

  • FSM Outline

    FSM Preliminaries

    FSM Algorithms

    gSpan

    SUBDUE

    SLEUTH

    Review

  • gSpan: Graph-Based Substructure PatternMining

    Written by Xifeng Yan & Jiawei Han in 2002.

    Form of pattern-growth mining algorithm.

    Adds edges to candidate subgraph

    Also known as, edge extension

    Avoid cost intensive problems like

    Redundant candidate generation

    Isomorphism testing

    Uses two main concepts to find frequent subgraphs

    DFS lexicographic order

    minimum DFS code

  • gSpan Inputs

    Set of graphs, support

    Graph of form = (,, , )

    , vertex and edge sets

    vertex labels

    edge labels

    label sets need not be one-to-one

    CC

    OO

    OO OO

    HH HH = {,,}

    = {singlebond,doublebond}

  • gSpan Components

    Depth-first Search (DFS) Code

    DFS Lexicographic Order

    minimal DFS code

    structuredgraph representation for building, comparing

    canonical comparisonof graphs

    selection, pruning of subgraphs

    Strategy: build frequent subgraphs bottom-up, using DFS code as regularized representation

    eliminate redundancies via minimal DFS codes based on code lexicographic ordering

  • Depth First Search Primer

    Todo?

  • gSpan: DFS codes

    0 (0,1,X,a,Y)1 (1,2,Y,b,X)2 (2,0,X,a,X)3 (2,3,X,c,Z)4 (3,1,Z,b,Y)5 (1,4,Y,d,Z)

    Format: (, ,,(, ),)

    , vertices by time of discovery,- vertex labels of ,

    (, )

    edge label between ,

    < : forward edge > : back edge

    Edge # Code

    Vertex discoverytimes

    X

    X

    Z

    Y

    Z

    a

    b

    c

    d

    a

    b

    0

    1

    2

    3

    4

    DFS Code: sequence of edgestraversed during DFS

  • DFS Code: Edge Ordering

    Edges in code ordered in very specific manner, corresponding to DFS process

    = (, ), = (, )

    appears before in code

    Ordering rules:

    1. if = and <

    from same source vertex, traversed before in DFS

    2. if < and =

    is a forward edge and traversed as result of traversal

    3. if and ,

    ordering is transitive

  • DFS Code: Edge Ordering Example

    Rule applications by edge #

    0 1(Rule 2)

    1 2(Rule 2)

    0 2(Rule 3)

    2 3(Rule 1)

    Exercise: what others?

    X

    X

    Z

    Y

    Z

    a

    b

    c

    d

    a

    b

    0

    1

    2

    3

    4

    0 (0,1,X,a,Y)1 (1,2,Y,b,X)2 (2,0,X,a,X)3 (2,3,X,c,Z)4 (3,1,Z,b,Y)5 (1,4,Y,d,Z)

    Edge # Code

    Edge ordering can be recorded easily during the DFS!

  • Graphs have multiple DFS Codes!

    X

    X

    Z

    Y

    Z

    a

    b

    c

    d

    a

    b X

    X

    Z

    Y

    Z

    a

    b

    c

    d

    a

    b

    0

    1

    2

    3

    4X

    Y

    ZX

    Z

    a

    a

    c

    d

    b

    b

    0

    1

    2

    3

    4

    Y

    X

    Z

    X

    Z

    a

    a

    b d

    b

    c

    0

    1

    2

    34

    Exercise: Write the 2 rightmost graphs using DFS code

    solution to redundant DFS codes: lexical ordering, minimal code!

  • DFS Lexicographic Ordering vs. DFS Code

    DFS code: Ordering of edge sequence of a particular DFS

    E.g. DFSs that start at different vertices may have different DFS codes

    Lexicographic ordering: ordering between different DFS codes

  • DFS Lexicographic Ordering

    Given lexicographic ordering of label set ,

    Given graphs , (equivalent label sets).

    Given DFS codes

    = code , = ,, ,

    = code , = ,, ,

    (assume )

    iffeitherofthefollowingaretrue: , 0 min , suchthat

    = for < and

    = 0

  • DFS Lex. Ordering: Edge Comparison

    Given DFS codes

    = code , = ,, ,

    = code , = ,, ,

    (assume )

    Given such that = for <

    Given = , , ,, , ,

    = , , ,, , ,

    if one of the following cases

    Case 1: Both forward edges, AND

    Case 2: Both back edges, AND

    Case 3: back, forward

  • Edge Comparison: Case 1 (both forward)

    Both forward edges, AND one of the following:

    < (edge starts from a later visited vertex)

    Why is this (think about DFS process)?

    = AND labels of lexicographically less than labels of , in order of tuple.

    Ex: Labels are strings, = __, __, m, e, x , = (__, __,m, u,x)

    m = m, e < u

    Note: if both forward edges, then = Reasoning: all previous edges equal, target vertex discovery times are the same

  • Edge Comparison: Case 2 (both back)

    Both back edges, AND one of the following:

    < (edge refers to earlier vertex)

    = AND edge label of lexicographically less than

    Note: given that all previous edges equal, vertex labels must also be equal

    Note: if both back edges, then = Reasoning: all previous edges equal, source vertex discovery times are the same.

  • 0 (0,1,X,a,Y)1 (1,2,Y,b,X)2 (2,0,X,a,X)3 (2,3,X,c,Z)4 (3,1,Z,b,Y)5 (1,4,Y,d,Z)

    Edge # Code (A)

    (0,1,Y,a,X)(1,2,X,a,X)(2,0,X,b,Y)(2,3,X,c,Z)(3,1,Z,b,X)(0,4,Y,d,Z)

    Code (B)

    (0,1,X,a,X)(1,2,X,a,Y)(2,0,Y,b,X)(2,3,Y,b,Z)(3,0,Z,c,X)(2,4,Y,d,Z)

    Code (C)

    X

    X

    Z

    Y

    Z

    a

    b

    c

    d

    a

    b

    0

    1

    2

    3

    4

    X

    Y

    ZX

    Z

    a

    a

    c

    d

    b

    b

    0

    1

    2

    3

    4

    Y

    X

    Z

    X

    Z

    a

    a

    b d

    b

    c

    0

    1

    2

    34

  • 0 (0,1,X,a,Y)1 (1,2,Y,b,X)2 (2,0,X,a,X)3 (2,3,X,c,Z)4 (3,1,Z,b,Y)5 (1,4,Y,d,Z)

    (0,1,Y,a,X)(1,2,X,a,X)(2,0,X,b,Y)(2,3,X,c,Z)(3,1,Z,b,X)(0,4,Y,d,Z)

    (0,1,X,a,X)(1,2,X,a,Y)(2,0,Y,b,X)(2,3,Y,b,Z)(3,0,Z,c,X)(2,4,Y,d,Z)

    X

    X

    Z

    Y

    Z

    a

    b

    c

    d

    a

    b

    0

    1

    2

    3