Mining Usage Patterns from a Repository of Scientific Workflow

Preview:

DESCRIPTION

Full paper: http://boole.diiga.univpm.it/paper/sac2012.pdf In many experimental domains, especially e-Science, workflow management systems are gaining increasing attention to design and execute in-silico experiments involving data analysis tools. As a by-product, a repository of workflows is generated, that formally describes experimental protocols and the way different tools are combined inside experiments. In this paper we describe the use of the SUBDUE graph clustering algorithm to discover sub-workflows from a repository. Since sub-workflows represent significant usage patterns of tools, the discovered knowledge can be exploited by scientists to learn by-example about design practices, or to retrieve and reuse workflows. Such a knowledge, ultimately, leverages the potential of scientific workflow repositories to become a knowledge-asset. A set of experiments is conducted on the myExperiment repository to assess the effectiveness of the approach.

Citation preview

Mining Usage Patterns from a Repositoryof Scientific Workflows

Claudia Diamantini, Domenico Potena, Emanuele Storti

DII, Universita Politecnica delle Marche, Ancona, Italy

SAC 2012, Riva del Garda, March 26-30

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 1 / 21

Introduction

Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:

audit trails of a workflow management systemtransaction logs of an enterprise resource planning system

Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21

Introduction

Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:

audit trails of a workflow management systemtransaction logs of an enterprise resource planning system

Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21

Introduction

Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:

audit trails of a workflow management systemtransaction logs of an enterprise resource planning system

Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21

Motivation

Little work about how to extract knowledge from a set of models (i.e.:process schemas)

understanding of common patterns of usage (best/worstpractices)support in process designprocess integration (from multiple sources)process optimization/re-engineeringindexing of processes and their retrieval

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 3 / 21

Methodology

Processes schemas have an inherent graph structure: manygraph-mining tools available

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21

Methodology

Processes schemas have an inherent graph structure: manygraph-mining tools available

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21

Methodology

Processes schemas have an inherent graph structure: manygraph-mining tools available

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21

Methodology

Processes schemas have an inherent graph structure: manygraph-mining tools available

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21

SUBDUE

SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL

Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21

SUBDUE

SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL

Comments:to compress G with a SUB: to replace every occurrence of SUBin G with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21

SUBDUE

SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL

Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent Gafter the compression is minimum

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

SUBDUE (cont’d)

Traditional cluster evaluation is not applicable to hierarchic domains

quality = interClusterDistanceintraClusterDistance

Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)

New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21

SUBDUE (cont’d)

Traditional cluster evaluation is not applicable to hierarchic domains

quality = interClusterDistanceintraClusterDistance

Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)

New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21

SUBDUE (cont’d)

Traditional cluster evaluation is not applicable to hierarchic domains

quality = interClusterDistanceintraClusterDistance

Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)

New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21

SUBDUE (cont’d)

Traditional cluster evaluation is not applicable to hierarchic domains

quality = interClusterDistanceintraClusterDistance

Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)

New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21

Representation

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Common operators:sequencesplit-AND (parallel split), split-XOR (exclusive choice)join-AND (synchronization), join-XOR (simple merge)

How to map the process to the graph? Several choicesEmanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 8 / 21

Representation models: A rule

lowest level of compactnessevery operator is represented by a nodecalled operator, linked to another nodespecifying its type (AND/OR/SEQ) and toits operandsjoin/split are distinguishable by thenumber of in/out-going arcs

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 9 / 21

Representation models: B rule

medium level of compactnessthe closest to the original representationjoin/split are replaced by different nodes,one for each kind of operator

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 10 / 21

Representation models: C rule

highest level of compactnessimplicit representation of operatorsmultiple alternative graphs in case ofSPLIT-XOR or JOIN-XORarcs are labeled to preserve informationabout domain/range nodes

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 11 / 21

Evaluation

Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):

attribute name → edgevalue → a node

Need of experimentations to assess which one performs best w.r.t.the quality of clustering.

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21

Evaluation

Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):

attribute name → edgevalue → a node

Need of experimentations to assess which one performs best w.r.t.the quality of clustering.

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21

Evaluation

Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):

attribute name → edgevalue → a node

Need of experimentations to assess which one performs best w.r.t.the quality of clustering.

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21

myExperiment

Dataset:258 processes (XScufl language for Taverna framework)e-Science WF for distributed computation over scientific dataapplications: local tools, Web Services, scripts

Setting:WF extraction/parsingpreprocessingtranslation into graphs (A,B,Crules)SUBDUE execution

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 13 / 21

Experimental results

Output lattice (fragment)

Red boxes = top level SUBs

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 14 / 21

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Use case (cont’d)

Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 17 / 21

Use case (cont’d)

Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 17 / 21

Discussion

A graph-based clustering technique (SUBDUE) to recognize the mostfrequent/common subprocesses among a set of workflows/processschemas.

Commentsseveral representation alternatives for application of SUBDUEevaluations on specialized e-Science domainuseful to find typical patterns and schemas of usage for tools/tasks

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 18 / 21

Conclusion

Applications:recognition of reference processes (common/best/worstpractices)organize a process repository (by indexing processes throughsubstructures)enterprise integration: find similarities (differences, overlapping,complementariness) in BP in different companies

Future work:extending experimentations, especially with (real) BusinessProcessesmanagement of heterogeneity (syntactic/semantic, level ofgranularity)

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 19 / 21

Mining Usage Patterns from a Repositoryof Scientific Workflows

Claudia Diamantini, Domenico Potena, Emanuele Storti

DII, Universita Politecnica delle Marche, Ancona, Italy

SAC 2012, Riva del Garda, March 26-30

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 20 / 21

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

2 Compress the graph with the best SUB

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

2 Compress the graph with the best SUB

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

2 Compress the graph with the best SUB

Subsequent SUBs may be defined in terms of previously defined SUBs

Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

Recommended