Mining Usage Patterns from a Repository of Scientific Workflow

Mining Usage Patterns from a Repositoryof Scientific Workflows

Claudia Diamantini, Domenico Potena, Emanuele Storti

DII, Universita Politecnica delle Marche, Ancona, Italy

SAC 2012, Riva del Garda, March 26-30

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 1 / 21

Introduction

Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:

audit trails of a workflow management systemtransaction logs of an enterprise resource planning system

Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?

Introduction

Motivation

Little work about how to extract knowledge from a set of models (i.e.:process schemas)

understanding of common patterns of usage (best/worstpractices)support in process designprocess integration (from multiple sources)process optimization/re-engineeringindexing of processes and their retrieval

Methodology

Processes schemas have an inherent graph structure: manygraph-mining tools available

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)

Methodology

SUBDUE

SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL

Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum

SUBDUE

Comments:to compress G with a SUB: to replace every occurrence of SUBin G with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum

SUBDUE

Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent Gafter the compression is minimum

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

SUBDUE (cont’d)

Traditional cluster evaluation is not applicable to hierarchic domains

quality = interClusterDistanceintraClusterDistance

Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)

New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application

SUBDUE (cont’d)

Representation

Common operators:sequencesplit-AND (parallel split), split-XOR (exclusive choice)join-AND (synchronization), join-XOR (simple merge)

How to map the process to the graph? Several choicesEmanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 8 / 21

Representation models: A rule

lowest level of compactnessevery operator is represented by a nodecalled operator, linked to another nodespecifying its type (AND/OR/SEQ) and toits operandsjoin/split are distinguishable by thenumber of in/out-going arcs

Representation models: B rule

medium level of compactnessthe closest to the original representationjoin/split are replaced by different nodes,one for each kind of operator

Representation models: C rule

highest level of compactnessimplicit representation of operatorsmultiple alternative graphs in case ofSPLIT-XOR or JOIN-XORarcs are labeled to preserve informationabout domain/range nodes

Evaluation

Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):

attribute name → edgevalue → a node

Need of experimentations to assess which one performs best w.r.t.the quality of clustering.

Evaluation

myExperiment

Dataset:258 processes (XScufl language for Taverna framework)e-Science WF for distributed computation over scientific dataapplications: local tools, Web Services, scripts

Setting:WF extraction/parsingpreprocessingtranslation into graphs (A,B,Crules)SUBDUE execution

Experimental results

Output lattice (fragment)

Red boxes = top level SUBs

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Use Case

Use case (cont’d)

Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern

Use case (cont’d)

Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern

Discussion

A graph-based clustering technique (SUBDUE) to recognize the mostfrequent/common subprocesses among a set of workflows/processschemas.

Commentsseveral representation alternatives for application of SUBDUEevaluations on specialized e-Science domainuseful to find typical patterns and schemas of usage for tools/tasks

Conclusion

Applications:recognition of reference processes (common/best/worstpractices)organize a process repository (by indexing processes throughsubstructures)enterprise integration: find similarities (differences, overlapping,complementariness) in BP in different companies

Future work:extending experimentations, especially with (real) BusinessProcessesmanagement of heterogeneity (syntactic/semantic, level ofgranularity)

Mining Usage Patterns from a Repositoryof Scientific Workflows

Claudia Diamantini, Domenico Potena, Emanuele Storti

DII, Universita Politecnica delle Marche, Ancona, Italy

SAC 2012, Riva del Garda, March 26-30

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

SUBDUE

2 Compress the graph with the best SUB

SUBDUE

Subsequent SUBs may be defined in terms of previously defined SUBs

Iterations of this basic step results in a lattice

Mining Usage Patterns from a Repository of Scientific Workflow

Technology

Creating Schemas with the Repository Creation Utility The Creating Schemas with the Repository Creation Utility book contains overview information and usage instructions for Oracle

World Digital Library OSI | WEB SERVICES User Response and Development Priorities User comments Website usage statistics Production workflow

Extending the OAuth2 Workflow to Audit Data Usage for ... · Extending the OAuth2 Workflow to Audit Data Usage for Users and Service Providers in a Cooperative Scenario Marius Politze,

WMS: Democratizing Data WMS Overview Grace Agnew WMS Workflow Management System for Rucore Rutgers Community Repository

RepoMMan: using Web Services and BPEL to facilitate workflow interaction with a digital repository

OpenAIRE Metrics Service: Usage Statistics (webinar for repository managers)

Repository Creation Utility User’s Guide 11g …ix Preface The Oracle Fusion Middleware Repository Creation Utility User's Guide contains overview information and usage instructions

Institutional Repository Usage Statistics · 2019-03-28 · Background Jisc began work in repository usage statistics area in 2009 –with requirements gathering and feasibility testing

Eclipse JWT – Java Workflow Tooling · Eclipse JWT – Java Workflow Tooling Title of this document Workflow Editor (WE): Installation and Usage Tutorial Document information last

Embedding the repository in the researchers workflow Martin van Luijt Head of Innovation and Development

SAPTips: Workflow Troubleshooting and Debugging › wp-content › uploads › 2009 › 08 › workflow... · some object orientated concepts (use of the Business Object Repository),

Secure Workflow Repository for Askalon

Staffing and Workflow of a Maturing Institutional Repository

Improving workflow and resource usage in construction ... · construction schedules through location-based management system (LBMS) ... Improving workflow and resource usage in construction

IMS 12 System and Repository 8573 - SHARE · IMS 12 System Enhancements and the IMS Repository ... • DRD usage of the IMS repository function ... routing codes & descriptors

“ABAP Workbench Usage” Contents Index The Authors · In this chapter, authors Puneet Asthana and David Haslam review ABAP Workbench usage, including the repository browser,

Workflow Usage Best Practices 1226349559334573 8

Niso nasig usage webinar op-usage in the workflow

Workflow Guide P-Card Workflow - tamuk.edutamuk.edu/ssgs/Forms/LFWFGuidePCard.pdf · Workflow Guide – P-Card Workflow . OPEN LASERFICHE CLIENT OR . 1. CLICK ON THE TAMUK REPOSITORY

Oracle Fusion Middleware · The Creating Schemas with the Repository Creation Utility book contains overview information and usage instructions for Oracle Repository Creation Utility