25
Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation at the NeSC, Edinburgh August 13, 2008

Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Embed Size (px)

Citation preview

Page 1: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Towards New Models and Languages for Data Mining and Integration

Peter Brezany

Institute of Scientific ComputingUniversity of Vienna, Austria

Presentation at the NeSC, EdinburghAugust 13, 2008

Page 2: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 2

Outline

Introduction CRISP-DM Model and Methodology

What is CRISP-DM Why update it From CRISP-DM to CRISP-DMI

Impact of CRISP-DMI on the DMI Workflow Language

State of the Art in Language Design Discussion of the 1st Language Design Ideas Conclusions and Future Work

Page 3: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 3

What is CRISP-DM?

Phases of the CRoss Industry Standard Process for Data Mining

Page 4: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 4

CRISP-DM Phases

Business Understanding: the process of understanding the project objectives from a business perspective

Data Understanding: the process of collecting and becoming familiar with data

Data Preparation: the process of selecting and cleansing the data that will be fed into the modeling tools

Modeling: the process of applying modeling to manipulate the data so that conclusions can be drawn

Evaluation: the process of evaluating the model and its conclusions

Deployment: the process of applying the conclusions to a business

Page 5: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 5

Why to Update CRISP-DM?

Support for large-scale data mining a lot of distributed, heterogeneous and large

datasets (primary data, derived data, background data, catalogs): from data to “space of data”

data integration is of great importance new actors (domain expert, data analyst, data

publisher, system administrator) support by new components (e.g. provenance) etc.

Our approach: from CRISP-DM to CRISP-DMI (Cross Research & Industry Standard Process for Data Mining and Integration )

Page 6: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 6

CRISP-DMI Model

Page 7: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 7

Space of Data and Services

Author: Ibrahim Elsayed

Page 8: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 8

TCM Workflow

Page 9: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 9

Subworkflow Targeted by Provenance

Page 10: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 10

Visualization of Provenance Data

Authors: Y. Han & F.A. Khan

Page 11: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 11

Use case

The fields in the data are:

Age: Sex: M or F BP: Blood Pressure-High, Normal, or Low Cholesterol: Blood Cholesterol Level-Normal or High Na: Blood sodium concentration K: Blood potassium concentration Drug: The drug to which this patient responded

The business question: Can we find which drug is appropriate for anyfuture patient?

(from P. Caron, C. Shearer, Interactive Visual Workflow: The Key to Streamlining the Data Mining Process)

Page 12: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 12

DmiFlow: DMI Workflow Language

The emerging DMI applications lead to the demand of a powerful DMI workflow language

On top of it interactive GUIs can be developed

It should enable optimized implementation of language processors

Page 13: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 13

DMI Process to be Composed by DmiFlow

Space of Source and Destination Data and Services

Space of Source and Destination Data and Services

DMIProcess

Com

posit

ion

Page 14: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 14

A Possible Position of DmiFlow in the Workflow Management Systems

Tex

tual

rep

rese

nta

tio

n

Sys

tem

su

pp

ort

BPEL orother language

Su

b-w

ork

lfo

w f

or

e 2

Sys

tem

su

pp

ort

e21

e23

e22

e24

UML

Hig

h-l

evel

wo

rkfl

ow

co

mp

osi

tio

n

Sys

tem

su

pp

ort

Feedback

User-relevant information flow

System-relevant information flow

e1

e2

e3

User

Visualisation

UML

Page 15: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 15

Principles for DMI Language Design

Programmer Responsibilities Identification of Parallelism Specifying communication mode between

workflow components Providing hints (sometimes based on domain

knowledge) enabling advanced optimization Language Desiderata

High abstraction level, not too complex (high productivity)

Advanced compositional features Execution of data mining queries (support for the

inductive database model) Extendibility Efficient implementation (high performance)

Page 16: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 16

Related Work

Low-level workflow notations: XML-based: BPEL4WS, DSCL, WSFL, etc. Other: Sculf (Taverna), MoML (Kepler), etc.

High-level languages (only for workflows integrating business processes): Workflow Prolog Valmont: It includes, process model, information

model, and organization model (It registers organizational structure and resources.)

C & Co: a C based language F#: functional workflow specification at a script

level (MicroSoft development) Martlet: functional workflow specification

Compositional languages (Strand, PCN, etc.)

Page 17: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 17

Workplan for the Language Design

Phase 1 (ongoing): proposing semantic structure and outlining compositional structure of programs while leaving open some aspects of their concrete representations as strings of symbols.

Phase 2: finalizing the 1st language definition version.

Page 18: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 18

Basic Features of DmiFlow

Code modules – managing complexity Activities: their types, parameters, locations Virtual communication channels between

activities, which can be represented by Persistent explicit datasets Internal datasets (implementation dependent) Ports used for streaming data

Control structures: parallel & sequential statements, loop statements, conditional statements)

Embedded data mining query execution

Page 19: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 19

Declaration of Activities and Datasets

activity activity_name: ActivityType at (activity_location);

ActivityType – predefined (type of parameters and semantics)

activity_location ∊ {url, discover, default} this is optional

dataset dataset_name represents (source = source_spec, hints_list);

source_spec ∊ {url, internal, port}

hint ∊ {org = dataset_organization, size = estimated_size, …}

dataset_organization ∊ {set, sequence, bag, …}

Page 20: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 20

Basic Control Structures

Concurrent execution:

cobegin { activity1(…); … activityn(…);}

Sequential execution:

block { activity1(…); … activityn(…);}

Data mining query execution:

exec dmq (arguments) byactivity (activity_name){ dmq_query_specification}

Page 21: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 21

Workflow Example – Graphical Form

Page 22: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 22

DmiFlow Example (1)

module WorkflowExample {

const replaceMethod = "average", splitingMethod = "gini", //hint url1 = "/serverA/dmi/services/integrationService1", url2 = "/serverB/dmi/services/decisionTreeService1", url3 = "/serverB/dmi/services/neuralNetworkService3";

activity integrDS: dataIntegrationActType at (url1), missVals: MissingValuesActType at (discover), normalise: NormalisForNNActType at (default), dt: decisionTreeActType at (url2), nn:NeuralNetworkActType at (url3);

dataset ….

Page 23: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 23

DmiFlow Example (2)

dataset ds1 represents (source = "http://www.myproject/d1.dat", org = set, size = [1.5, 2.0]), ds2 represents (source = "http://www.myproject/d2.dat", type = set), intConf represents (source = "/server/dmi/config/integr.conf); outIntegr represents (source = internal, org = set), cleaned represents (source = internal, org = set); normalised represents (source = internal, org = set); nnConf represents (source = "/server/dmi/configs/nn.conf); nnMod represents (source = "/server/dmi/models/nn.pmml); dtMod represents (source = "/server/dmi/models/dt.pmml);

defworkflow { . . . }

Page 24: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 24

DmiFlow Example (3)

defworkflow main () { integrDSets (in ds1, ds2, intConf; out outItegr); missValues (in outIntegr, replaceMethod; out cleaned); cobegin { block { normalise (in cleaned; out normalised); nn (in normalised, nnConf; out nnMod); } dt (in cleaned, splittingMethod; out dtMod); } }

Page 25: Towards New Models and Languages for Data Mining and Integration Peter Brezany Institute of Scientific Computing University of Vienna, Austria Presentation

Edinburgh, 13 Aug, 2008 25

Future Work

Extend language functionality Investigate DmiFlow execution model

for the ADMIRE architecture Define functional specification of the

DmiFlow language processor Specify concrete language syntax