39
Advisor: Prof.ssa Claudia Diamantini Curriculum supervisor: Prof. Sauro Longhi Università Politecnica delle Marche Scuola di dottorato in Scienze dell’Ingegneria Curriculum in Ingegneria Informatica, Gestionale e dell’Automazione 28 Febbraio 2012 rocess Design in Collaborative and Distributed Environments Emanuele Storti

Kdd Process Design in Collaborative and Distributed Environments

Embed Size (px)

DESCRIPTION

Full thesis (open access): http://openarchive.univpm.it/jspui/handle/123456789/494 Abstract ------------- Knowledge Discovery in Databases (KDD), as well as scientific experimentation in e-Science, is a complex and computationally intensive process aimed at gaining knowledge from a huge set of data. Often performed in distributed settings, KDD projects usually involve a deep interaction among heterogeneous tools and several users with specific expertise. Given the high complexity of the process, such users need effective support to achieve their goal of knowledge extraction. This work presents the Knowledge Discovery in Databases Virtual Mart (KDDVM), a user- and knowledge-centric framework aimed at supporting the design of KDD processes in a highly distributed and collaborative scenario, in which computational resources and actors dynamically interoperate to share and elaborate knowledge. The contribution of the work is two-fold: firstly, a conceptual systematization of the relevant knowledge is provided, with the aim to formalize, through semantic technologies, each element taking part in the design and execution of a KDD process, including computational resources, data and actors; secondly, we propose an implementation of the framework as an open, modular and extensible Service-Oriented platform, in which several services are available both to perform basic operations of data manipulations and to support more advanced functionalities. Among them, the management of deployment/activation of computational resources, service discovery and their composition to build KDD processes. Since the cooperative design and execution of a distributed KDD process typically require several skills, both technical and managerial, collaboration can easily become a source of complexity if not supported by any kind of coordination. For such reasons, a set of functionalities of the platform is specifically addressed to support collaboration within a distributed team, by providing an environment in which users can work on the same project and share processes, results and ideas.

Citation preview

Page 1: Kdd Process Design in Collaborative and Distributed Environments

Advisor: Prof.ssa Claudia DiamantiniCurriculum supervisor: Prof. Sauro Longhi

Università Politecnica delle MarcheScuola di dottorato in Scienze dell’IngegneriaCurriculum in Ingegneria Informatica, Gestionale e dell’Automazione

28 Febbraio 2012

KDD Process Design in Collaborative and Distributed Environments

Emanuele Storti

Page 2: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Summary

I. Introduction• Background & Motivation• Related Work• Research Question

II. ApproachIII. Knowledge LayerIV. Platform

• Service Discovery• Process Composition• Collaboration features• Formal Verification of processes

V. Conclusion

Page 3: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Introduction

“Knowledge is the common wealth of humanity”*

Access to vast collections of data is the key to understanding and responding to complex problems

* A. Samassekou, UN World Summit on the Information Society

Growing capability in data production: Data Deluge• availability of cost-effective technologies• enhancement of communication infrastructures

• global problems ask for data-intensive computation (genomics, physics, climatology)• global effort through world-wide scientific collaboration

Need of tools for supporting distribution and collaboration in data analysis

Not only organizations, also e-Science

1

Page 4: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Background

Knowledge Discovery in Databases (KDD)Process of identifying valid, novel, potentially useful patterns in data [Fayyad, 1996]

• knowledge refinement• many steps, several iterations• strong user interaction • need of user knowledge

Support to:• decision making in organizations• e-Science experimentations

2

Page 5: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Related Work

Early proposals (Intelligent Data Analysis systems):

Distribution of tools & computational aspects:

• local frameworks• single-user• predefined set of tools (little extensibility)

Distribution of users & evolution of organizations:

• SOA [Ali, 2005] [Kumar, 2005]• Grid [Kgrid, 2003]• OGSA: SOA + Grid

• user support in advanced activities [Bernstein, 2005] [Žáková, 2011]• collaboration [myExperiment, 2009]

3

Page 6: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Motivation

Main issues in open, distributed and collaborative scenarios

• Localization of many available distributed tools

• Integration of heterogeneous tools (interfaces, programming languages, OSs, transfer protocols,…)

• Complexity in their usage (data preparation, I/O interpretation, precondition

satisfaction, process design), many possible combinations

• Coordination in cooperative work (source of complexity)

4

Page 7: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Research Question

How to support a community of users in the design of a KDD project in an open, heterogeneous, distributed and collaborative scenario?

1. Which kind of support and functionalities should a platform offer?2. Which principles should the platform be based on?3. Which resources are involved in a KDD process and how to

represent them?

5

Page 8: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Functional Requirements

1. Support functionalities that should be offered by the platform:

• importing new heterogeneous tools for several KDD tasks• retrieving tools useful for certain purposes• designing a process by connecting tools together

• understanding whether such connections are meaningful• suggesting tool’s sequences effective for a given goal

• execution of the process• supporting collaboration all through the design process

• co-operative design• communication among team’s members

6

Page 9: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Non-functional Requirements

2. Principles on which the platform should be based

Requirements • Interoperability• Flexibility• Modularity• Reusability• Transparency• Usability

Main issues • heterogeneity• complexity• distribution• coordination

7

Page 10: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Resources in a KDD project

3. Identification of resources involved in a KDD process

• Computational units (gathering, preparation, modeling, visualization, …)• Data and Models (dataset, input parameters, intermediate results, final model)• Actors (domain experts, DB/DWH administrators, DM and KDD experts) • Computational processes

8

Page 11: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Knowledge- and user-centric approach

Collaborative functionalities• co-operation• communication facilities

Knowledge LayerData model: systematization of knowledge regarding each KDD resource, from both a theoretical and an operational perspective

Basic ServicesServices for every KDD phase. • tools are wrapped as services, • deployed on a server,• published in a common repository

Support Services• back-end functionalities• advanced functionalities

Service-Oriented platform

tool servicerunning as

9

Page 12: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Conceptual: abstract representation

Knowledge Layer

Methodology • degrees of abstraction for description of resources• semantic technologies for their representation

• formal ontology building methodology [Fernandez, 1997] [Staab, 2001]• quality requirements [Gruber, 1995]

Domain ontologies

Concrete: actual resources of the platform

Execution: logs and traces

Semantically annotated XML descriptors

10

Page 13: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Knowledge Layer > Computational units

tool

service

algorithm

running as

implemented as

KDDONTO

Algorithm

Task

Method

Phase

Data

Performance Index

• algorithms arranged in a taxonomy• method, task, KDD phase, performances• data arranged in a taxonomy• algorithms’ interfaces (I/O, pre/post-conditions)• …

Implemented in OWL-DL (ALCOIF expressivity)

[IDA 2009] [ECML 2009]

11

Page 14: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Knowledge Layer > Computational units

SAWSDL: Semantic annotations for WSDL (w3c)

Descriptors are stored in a UDDI registry

<location><I/O><algorithm><performance><author>

eSAWSDL

• location (<xmls:impl>)• I/O <message>: type, syntax

Further extensions:• implemented <algorithm>• <performance>• <author>• …

tool

service

algorithm

running as

implemented as

UDDI

Algorithm alg

SERVICE

WSDL urlQos val

12

Page 15: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Knowledge Layer > Computational units

Mappings between abstraction levels

Labeled Dataset

abc

C4.5algorithm

Remove missing valuesalgorithm

KDDONTO

• KDDONTO provides a shared vocabulary which services refer to• such mappings can support process composition

<location><output><algorithm><performance><author>

<location><input><algorithm><performance><author>

eSAWSDL2eSAWSDL1

13

Page 16: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Knowledge Layer > Actors

TeamONTO is a formal ontology devoted to represent details of actors [CTS 2012]

Person

Project

Algorithm

WebService

Publication

Domain

Organization

about

memberOf

worksIn

writes

hasSkill

hasSkill

hasSkill

TeamONTO

14

Page 17: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Knowledge Layer > Processes

<location><I/O><algorithm><performance><author>

eSAWSDL 1

<location><I/O><algorithm><performance><author>

eSAWSDL 2

<location><I/O><algorithm><performance><author>

eSAWSDL 3

<location><I/O><algorithm><performance><author>

eSAWSDL 4

XML-descriptor for concrete level processes• <services> included in the process• <connection> among their I/O interfaces• <users> in charge of settings service parameters• metadata: creation <date>/<time>, <author>, <comments>

Process descriptors are stored in a Process Repository

ProcRep index

project

15

Page 18: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Knowledge Layer > Processes

ProcRepindex

• extracts the most frequent and common subprocesses• arrange them in a lattice

Process Repository indexing method [SEBD 2011][SAC 2012]

Application of graph-based hierarchical clustering [Jonyer, 2002]

hierarchical clustering

16

Page 19: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Knowledge Layer > Processes

ProcRepUser query index

graph matching

• extracts the most frequent and common subprocesses• arrange them in a lattice

Process Repository indexing method [SEBD 2011][SAC 2012]

Application of graph-based hierarchical clustering [Jonyer, 2002]

Support for process retrieval: subgraph isomorphism on the index is more efficient than on the whole repository

16

Page 20: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Knowledge Layer > Overview

KDDONTO

ProcRep index

eSAWSDL 1

Process

Person

Project

Algorithm

WebService

Algorithm

WebServiceDataPerformance

<location><I/O><algorithm><performance><author>

UDDI

Algorithm alg

SERVICE

WSDL urlQos val

TeamONTO

eSAWSDL 2

<location><I/O><algorithm><performance><author>

project

Advantages: loose-coupling, modularity, reusability, support to advanced functions

17

Page 21: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

KDDVM Platform

Service-oriented platform for KDD process design

support services

collaborative functionalities

[CTS 2010][CTS 2011][IFS]

18

Page 22: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

KDDVM Platform > KDDDesigner

• integration point of other support services• platform front-end

19

Page 23: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

KDDONTOAlgorithm

GoalData

UDDI

Algorithm

SERVICE

WSDL urlQos val

alg

Retrieval of KDD services satisfying user requirements• syntactic service discovery (service name)• semantic service discovery (functionalities, goal, interface,…)

1. Search, in KDDONTO, for algorithm satisfying the requirements2. Search, inside UDDI, for services implementing such algorithms

<location><I/O><algorithm><performance><author>

eSAWSDL

Service Discovery

20

Page 24: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Service Discovery

Retrieval of KDD services satisfying user requirements• syntactic service discovery (service name)• semantic service discovery (functionalities, goal, interface,…)

1. Search, in KDDONTO, for algorithm satisfying the requirements2. Search, inside UDDI, for services implementing such algorithms

21

Page 25: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Process Composition

Verification of I/O data compatibility

22

Page 26: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Process Composition: matchmaking

Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010]

abc

KDDONTO

Same format?Same syntax?

• syntactic compatibility: comparison between service descriptors

<location><I/O><algorithm><performance><author>

<location><I/O><algorithm><performance><author>

eSAWSDL1 eSAWSDL2

23

Page 27: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Process Composition: matchmaking

Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010]

abc

data2data1

KDDONTO

same concept?subconcept?part-of concept?

Output:match cost

match evaluationfunction:

• syntactic compatibility: comparison between service descriptors• semantic compatibility: comparison between ontological annotations of the services (kind of match between I/O, preconditions/postconditions...)

<location><I/O><algorithm><performance><author>

<location><I/O><algorithm><performance><author>

eSAWSDL1 eSAWSDL2• kind of match• preconditions• etc…

23

Page 28: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Process Composition: matchmaking

Usage of Matchmaking:

• provide the user with information about the validity of the match• find all services compatible with a given one• support an advanced functionality for semi-automatic process

composition

24

Page 29: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Process Composition: semi-automatic procedure

Procedure to generate conceptual processes [IDA 2009] [ITAIS 2009] [ECAI 2010]

• planning technique: goal-driven, backwards strategy • basic step based on matchmaking• pruning criteria, stop criteria, user constraints• results: abstract processes useful as templates, ranked according to a metric

25

Page 30: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Collaborative Features

Collaborative features for process design within a team 1. Team building: to find users with certain skills & build a team2. Multi-user Process design and Versioning3. Communication4. Task assignment

Retrieval of users with certain competencies, inside TeamONTO

[CTS 2011]

26

Page 31: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Collaborative Features

Collaborative features for process design within a team1. Team building: to find users with certain skills & build a team2. Multi-user Process design and Versioning3. Communication4. Task assignment

Asynchronous edit of a process, and its storage as a new version in the repository

[CTS 2011]

26

Page 32: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Collaborative Features

Collaborative features for process design within a team1. Team building: to find users with certain skills & build a team2. Multi-user Process design and Versioning3. Communication4. Task assignment

Talk page & comments attached to versions

[CTS 2011]

26

Page 33: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Collaborative Features

Collaborative features for process design within a team1. Team building: to find users with certain skills & build a team2. Multi-user Process design and Versioning3. Communication4. Task assignment

Assignment to a user of the team the parameter setting for the service execution

[CTS 2011]

26

Page 34: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Formal Verification of Processes

1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]

• language for representation of interaction among services• formal and compositional semantics

2. Specification of the desider behavior through mCRL2 [Groote, 2006]

3. Model-checking to verify whether the process satisfies the behavior

27

Page 35: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Formal Verification of Processes

1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]

• language for representation of interaction among services• formal and compositional semantics

2. Specification of the desider behavior through mCRL2 [Groote, 2006]

3. Model-checking to verify whether the process satisfies the behavior

27

Page 36: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Formal Verification of Processes

1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]

• language for representation of interaction among services• formal and compositional semantics

2. Specification of the desider behavior through mCRL2 [Groote, 2006]

3. Model-checking to verify whether the process satisfies the behavior

Purposes [SEBD 2010]

• formal verification of KDD processes at design-time• reuse of typical pre-verified KDD subprocesses

27

Page 37: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Conclusion

KDD process design in distributed and collaborative environments

• Distributed and heterogeneous settings:• systematic use of semantic information all through the process• loosely-coupled service architecture

• Community-centered approach:• support functionalities • users with various degrees of experience

• General-purpose attitude:• domain-indipendence• generalization to experimental processes in e-Science

28

Page 38: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

References

[Fayyad, 1996] “From data mining to knowledge discovery: an overview”, American Association for Artificial Intelligence.

[Ali, 2005] “Web services composition for distributed data mining”, Parallel Processing[Kgrid, 2003] “The knowledge grid”, Comm. ACM[Bernstein, 2005] “Towards Intelligent Assistance for a Data Mining Process: An Ontology

Based Approach for Cost-Sensitive Classification”, IEEE TKDE [Žáková, 2011] “Automating Knowledge Discovery Workflow Composition Through

Ontology-Based Planning”, IEEE T. Automation Science and Engineering[myExperiment, 2006] “The Design and Realisation of the Virtual Research Environment

for Social Sharing of Workflows”, FGCS[Fernandez, 1997] “METHONTOLOGY: from Ontological Art towards Ontological

Engineering”, Proc. of the AAAI97 Spring Symposium.[Staab, 2001] “Knowledge Processes and Ontologies”, IEEE Intelligent Systems.[Gruber, 1995] “Toward principles for the design of ontologies used for knowledge

sharing”, Int. J. Hum.-Comput. Stud.[Jonyer, 2002] “Graph-based hierarchical conceptual clustering”, J. Mach. Learn. Res.[Groote, 2006] “The formal specification language mcrl2”, in MMOSS.[Arbab, 2004] “Reo: a channel-based coordination model for component composition”,

Mathematical. Structures in Comp. Sci.

Page 39: Kdd Process Design in Collaborative and Distributed Environments

Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments

Publications

[IDA 2009] “Ontology-driven KDD Process Composition”. LNCS, Springer, 2009.[ECML 2009] “KDDONTO: an Ontology for Discovery and Composition of KDD

Algorithms”. ECML/PKDD09 Workshop.[ITAIS 2009] “Automatic Definition of KDD Prototype Process by Composition”.

“Management of Interconnected World”, Springer.[CTS 2010] “Semantic-Driven Design and Management of KDD Processes”. CTS 2010,

IEEE.[SEBD 2010] “Towards Coordination Patterns for Complex Experimentations in Data

Mining”. SEBD 2010.[ECAI 2010] “Supporting Users in KDD Process Design. A Semantic Similarity Matching

Approach”. Planning To Learn Workshop in ECAI2010.[CTS 2011] “Semantic-Aided Designer for Knowledge Discovery”. CTS 2011, IEEE.[SEBD 2011] “Clustering of process schema by graph mining techniques”. SEBD 2011.[SAC 2012] “Mining Usage Patterns from a Repository of Scientific Workflows”. SAC2012,

ACM.[CTS 2012] “Semantically-supported Team Building in a KDD Virtual Environment”. CTS

2012, IEEE.[ISF] “A Virtual Mart for Knowledge Discovery in Databases”. Information System

Frontiers, Springer, accepted.