Upload
emanuele-storti
View
317
Download
3
Embed Size (px)
DESCRIPTION
Full thesis (open access): http://openarchive.univpm.it/jspui/handle/123456789/494 Abstract ------------- Knowledge Discovery in Databases (KDD), as well as scientific experimentation in e-Science, is a complex and computationally intensive process aimed at gaining knowledge from a huge set of data. Often performed in distributed settings, KDD projects usually involve a deep interaction among heterogeneous tools and several users with specific expertise. Given the high complexity of the process, such users need effective support to achieve their goal of knowledge extraction. This work presents the Knowledge Discovery in Databases Virtual Mart (KDDVM), a user- and knowledge-centric framework aimed at supporting the design of KDD processes in a highly distributed and collaborative scenario, in which computational resources and actors dynamically interoperate to share and elaborate knowledge. The contribution of the work is two-fold: firstly, a conceptual systematization of the relevant knowledge is provided, with the aim to formalize, through semantic technologies, each element taking part in the design and execution of a KDD process, including computational resources, data and actors; secondly, we propose an implementation of the framework as an open, modular and extensible Service-Oriented platform, in which several services are available both to perform basic operations of data manipulations and to support more advanced functionalities. Among them, the management of deployment/activation of computational resources, service discovery and their composition to build KDD processes. Since the cooperative design and execution of a distributed KDD process typically require several skills, both technical and managerial, collaboration can easily become a source of complexity if not supported by any kind of coordination. For such reasons, a set of functionalities of the platform is specifically addressed to support collaboration within a distributed team, by providing an environment in which users can work on the same project and share processes, results and ideas.
Citation preview
Advisor: Prof.ssa Claudia DiamantiniCurriculum supervisor: Prof. Sauro Longhi
Università Politecnica delle MarcheScuola di dottorato in Scienze dell’IngegneriaCurriculum in Ingegneria Informatica, Gestionale e dell’Automazione
28 Febbraio 2012
KDD Process Design in Collaborative and Distributed Environments
Emanuele Storti
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Summary
I. Introduction• Background & Motivation• Related Work• Research Question
II. ApproachIII. Knowledge LayerIV. Platform
• Service Discovery• Process Composition• Collaboration features• Formal Verification of processes
V. Conclusion
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Introduction
“Knowledge is the common wealth of humanity”*
Access to vast collections of data is the key to understanding and responding to complex problems
* A. Samassekou, UN World Summit on the Information Society
Growing capability in data production: Data Deluge• availability of cost-effective technologies• enhancement of communication infrastructures
• global problems ask for data-intensive computation (genomics, physics, climatology)• global effort through world-wide scientific collaboration
Need of tools for supporting distribution and collaboration in data analysis
Not only organizations, also e-Science
1
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Background
Knowledge Discovery in Databases (KDD)Process of identifying valid, novel, potentially useful patterns in data [Fayyad, 1996]
• knowledge refinement• many steps, several iterations• strong user interaction • need of user knowledge
Support to:• decision making in organizations• e-Science experimentations
2
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Related Work
Early proposals (Intelligent Data Analysis systems):
Distribution of tools & computational aspects:
• local frameworks• single-user• predefined set of tools (little extensibility)
Distribution of users & evolution of organizations:
• SOA [Ali, 2005] [Kumar, 2005]• Grid [Kgrid, 2003]• OGSA: SOA + Grid
• user support in advanced activities [Bernstein, 2005] [Žáková, 2011]• collaboration [myExperiment, 2009]
3
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Motivation
Main issues in open, distributed and collaborative scenarios
• Localization of many available distributed tools
• Integration of heterogeneous tools (interfaces, programming languages, OSs, transfer protocols,…)
• Complexity in their usage (data preparation, I/O interpretation, precondition
satisfaction, process design), many possible combinations
• Coordination in cooperative work (source of complexity)
4
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Research Question
How to support a community of users in the design of a KDD project in an open, heterogeneous, distributed and collaborative scenario?
1. Which kind of support and functionalities should a platform offer?2. Which principles should the platform be based on?3. Which resources are involved in a KDD process and how to
represent them?
5
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Functional Requirements
1. Support functionalities that should be offered by the platform:
• importing new heterogeneous tools for several KDD tasks• retrieving tools useful for certain purposes• designing a process by connecting tools together
• understanding whether such connections are meaningful• suggesting tool’s sequences effective for a given goal
• execution of the process• supporting collaboration all through the design process
• co-operative design• communication among team’s members
6
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Non-functional Requirements
2. Principles on which the platform should be based
Requirements • Interoperability• Flexibility• Modularity• Reusability• Transparency• Usability
Main issues • heterogeneity• complexity• distribution• coordination
7
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Resources in a KDD project
3. Identification of resources involved in a KDD process
• Computational units (gathering, preparation, modeling, visualization, …)• Data and Models (dataset, input parameters, intermediate results, final model)• Actors (domain experts, DB/DWH administrators, DM and KDD experts) • Computational processes
8
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Knowledge- and user-centric approach
Collaborative functionalities• co-operation• communication facilities
Knowledge LayerData model: systematization of knowledge regarding each KDD resource, from both a theoretical and an operational perspective
Basic ServicesServices for every KDD phase. • tools are wrapped as services, • deployed on a server,• published in a common repository
Support Services• back-end functionalities• advanced functionalities
Service-Oriented platform
tool servicerunning as
9
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Conceptual: abstract representation
Knowledge Layer
Methodology • degrees of abstraction for description of resources• semantic technologies for their representation
• formal ontology building methodology [Fernandez, 1997] [Staab, 2001]• quality requirements [Gruber, 1995]
Domain ontologies
Concrete: actual resources of the platform
Execution: logs and traces
Semantically annotated XML descriptors
10
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Knowledge Layer > Computational units
tool
service
algorithm
running as
implemented as
KDDONTO
Algorithm
Task
Method
Phase
Data
Performance Index
• algorithms arranged in a taxonomy• method, task, KDD phase, performances• data arranged in a taxonomy• algorithms’ interfaces (I/O, pre/post-conditions)• …
Implemented in OWL-DL (ALCOIF expressivity)
[IDA 2009] [ECML 2009]
11
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Knowledge Layer > Computational units
SAWSDL: Semantic annotations for WSDL (w3c)
Descriptors are stored in a UDDI registry
<location><I/O><algorithm><performance><author>
eSAWSDL
• location (<xmls:impl>)• I/O <message>: type, syntax
Further extensions:• implemented <algorithm>• <performance>• <author>• …
tool
service
algorithm
running as
implemented as
UDDI
Algorithm alg
SERVICE
WSDL urlQos val
12
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Knowledge Layer > Computational units
Mappings between abstraction levels
Labeled Dataset
abc
C4.5algorithm
Remove missing valuesalgorithm
KDDONTO
• KDDONTO provides a shared vocabulary which services refer to• such mappings can support process composition
<location><output><algorithm><performance><author>
<location><input><algorithm><performance><author>
eSAWSDL2eSAWSDL1
13
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Knowledge Layer > Actors
TeamONTO is a formal ontology devoted to represent details of actors [CTS 2012]
Person
Project
Algorithm
WebService
Publication
Domain
Organization
about
memberOf
worksIn
writes
hasSkill
hasSkill
hasSkill
TeamONTO
14
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Knowledge Layer > Processes
<location><I/O><algorithm><performance><author>
eSAWSDL 1
<location><I/O><algorithm><performance><author>
eSAWSDL 2
<location><I/O><algorithm><performance><author>
eSAWSDL 3
<location><I/O><algorithm><performance><author>
eSAWSDL 4
XML-descriptor for concrete level processes• <services> included in the process• <connection> among their I/O interfaces• <users> in charge of settings service parameters• metadata: creation <date>/<time>, <author>, <comments>
Process descriptors are stored in a Process Repository
ProcRep index
project
15
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Knowledge Layer > Processes
ProcRepindex
• extracts the most frequent and common subprocesses• arrange them in a lattice
Process Repository indexing method [SEBD 2011][SAC 2012]
Application of graph-based hierarchical clustering [Jonyer, 2002]
hierarchical clustering
16
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Knowledge Layer > Processes
ProcRepUser query index
graph matching
• extracts the most frequent and common subprocesses• arrange them in a lattice
Process Repository indexing method [SEBD 2011][SAC 2012]
Application of graph-based hierarchical clustering [Jonyer, 2002]
Support for process retrieval: subgraph isomorphism on the index is more efficient than on the whole repository
16
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Knowledge Layer > Overview
KDDONTO
ProcRep index
eSAWSDL 1
Process
Person
Project
Algorithm
WebService
Algorithm
WebServiceDataPerformance
<location><I/O><algorithm><performance><author>
UDDI
Algorithm alg
SERVICE
WSDL urlQos val
TeamONTO
eSAWSDL 2
<location><I/O><algorithm><performance><author>
project
Advantages: loose-coupling, modularity, reusability, support to advanced functions
17
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
KDDVM Platform
Service-oriented platform for KDD process design
support services
collaborative functionalities
[CTS 2010][CTS 2011][IFS]
18
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
KDDVM Platform > KDDDesigner
• integration point of other support services• platform front-end
19
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
KDDONTOAlgorithm
GoalData
UDDI
Algorithm
SERVICE
WSDL urlQos val
alg
Retrieval of KDD services satisfying user requirements• syntactic service discovery (service name)• semantic service discovery (functionalities, goal, interface,…)
1. Search, in KDDONTO, for algorithm satisfying the requirements2. Search, inside UDDI, for services implementing such algorithms
<location><I/O><algorithm><performance><author>
eSAWSDL
Service Discovery
20
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Service Discovery
Retrieval of KDD services satisfying user requirements• syntactic service discovery (service name)• semantic service discovery (functionalities, goal, interface,…)
1. Search, in KDDONTO, for algorithm satisfying the requirements2. Search, inside UDDI, for services implementing such algorithms
21
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Process Composition
Verification of I/O data compatibility
22
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Process Composition: matchmaking
Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010]
abc
KDDONTO
Same format?Same syntax?
• syntactic compatibility: comparison between service descriptors
<location><I/O><algorithm><performance><author>
<location><I/O><algorithm><performance><author>
eSAWSDL1 eSAWSDL2
23
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Process Composition: matchmaking
Checking the validity of the match [IDA 2009] [ITAIS 2009] [ECAI 2010]
abc
data2data1
KDDONTO
same concept?subconcept?part-of concept?
Output:match cost
match evaluationfunction:
• syntactic compatibility: comparison between service descriptors• semantic compatibility: comparison between ontological annotations of the services (kind of match between I/O, preconditions/postconditions...)
<location><I/O><algorithm><performance><author>
<location><I/O><algorithm><performance><author>
eSAWSDL1 eSAWSDL2• kind of match• preconditions• etc…
23
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Process Composition: matchmaking
Usage of Matchmaking:
• provide the user with information about the validity of the match• find all services compatible with a given one• support an advanced functionality for semi-automatic process
composition
24
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Process Composition: semi-automatic procedure
Procedure to generate conceptual processes [IDA 2009] [ITAIS 2009] [ECAI 2010]
• planning technique: goal-driven, backwards strategy • basic step based on matchmaking• pruning criteria, stop criteria, user constraints• results: abstract processes useful as templates, ranked according to a metric
25
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Collaborative Features
Collaborative features for process design within a team 1. Team building: to find users with certain skills & build a team2. Multi-user Process design and Versioning3. Communication4. Task assignment
Retrieval of users with certain competencies, inside TeamONTO
[CTS 2011]
26
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Collaborative Features
Collaborative features for process design within a team1. Team building: to find users with certain skills & build a team2. Multi-user Process design and Versioning3. Communication4. Task assignment
Asynchronous edit of a process, and its storage as a new version in the repository
[CTS 2011]
26
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Collaborative Features
Collaborative features for process design within a team1. Team building: to find users with certain skills & build a team2. Multi-user Process design and Versioning3. Communication4. Task assignment
Talk page & comments attached to versions
[CTS 2011]
26
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Collaborative Features
Collaborative features for process design within a team1. Team building: to find users with certain skills & build a team2. Multi-user Process design and Versioning3. Communication4. Task assignment
Assignment to a user of the team the parameter setting for the service execution
[CTS 2011]
26
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Formal Verification of Processes
1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]
• language for representation of interaction among services• formal and compositional semantics
2. Specification of the desider behavior through mCRL2 [Groote, 2006]
3. Model-checking to verify whether the process satisfies the behavior
27
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Formal Verification of Processes
1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]
• language for representation of interaction among services• formal and compositional semantics
2. Specification of the desider behavior through mCRL2 [Groote, 2006]
3. Model-checking to verify whether the process satisfies the behavior
27
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Formal Verification of Processes
1. Process representation through REO (CWI, Amsterdam) [Arbab, 2004]
• language for representation of interaction among services• formal and compositional semantics
2. Specification of the desider behavior through mCRL2 [Groote, 2006]
3. Model-checking to verify whether the process satisfies the behavior
Purposes [SEBD 2010]
• formal verification of KDD processes at design-time• reuse of typical pre-verified KDD subprocesses
27
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Conclusion
KDD process design in distributed and collaborative environments
• Distributed and heterogeneous settings:• systematic use of semantic information all through the process• loosely-coupled service architecture
• Community-centered approach:• support functionalities • users with various degrees of experience
• General-purpose attitude:• domain-indipendence• generalization to experimental processes in e-Science
28
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
References
[Fayyad, 1996] “From data mining to knowledge discovery: an overview”, American Association for Artificial Intelligence.
[Ali, 2005] “Web services composition for distributed data mining”, Parallel Processing[Kgrid, 2003] “The knowledge grid”, Comm. ACM[Bernstein, 2005] “Towards Intelligent Assistance for a Data Mining Process: An Ontology
Based Approach for Cost-Sensitive Classification”, IEEE TKDE [Žáková, 2011] “Automating Knowledge Discovery Workflow Composition Through
Ontology-Based Planning”, IEEE T. Automation Science and Engineering[myExperiment, 2006] “The Design and Realisation of the Virtual Research Environment
for Social Sharing of Workflows”, FGCS[Fernandez, 1997] “METHONTOLOGY: from Ontological Art towards Ontological
Engineering”, Proc. of the AAAI97 Spring Symposium.[Staab, 2001] “Knowledge Processes and Ontologies”, IEEE Intelligent Systems.[Gruber, 1995] “Toward principles for the design of ontologies used for knowledge
sharing”, Int. J. Hum.-Comput. Stud.[Jonyer, 2002] “Graph-based hierarchical conceptual clustering”, J. Mach. Learn. Res.[Groote, 2006] “The formal specification language mcrl2”, in MMOSS.[Arbab, 2004] “Reo: a channel-based coordination model for component composition”,
Mathematical. Structures in Comp. Sci.
Emanuele Storti04/14/23 KDD Process Design in Collaborative and Distributed Environments
Publications
[IDA 2009] “Ontology-driven KDD Process Composition”. LNCS, Springer, 2009.[ECML 2009] “KDDONTO: an Ontology for Discovery and Composition of KDD
Algorithms”. ECML/PKDD09 Workshop.[ITAIS 2009] “Automatic Definition of KDD Prototype Process by Composition”.
“Management of Interconnected World”, Springer.[CTS 2010] “Semantic-Driven Design and Management of KDD Processes”. CTS 2010,
IEEE.[SEBD 2010] “Towards Coordination Patterns for Complex Experimentations in Data
Mining”. SEBD 2010.[ECAI 2010] “Supporting Users in KDD Process Design. A Semantic Similarity Matching
Approach”. Planning To Learn Workshop in ECAI2010.[CTS 2011] “Semantic-Aided Designer for Knowledge Discovery”. CTS 2011, IEEE.[SEBD 2011] “Clustering of process schema by graph mining techniques”. SEBD 2011.[SAC 2012] “Mining Usage Patterns from a Repository of Scientific Workflows”. SAC2012,
ACM.[CTS 2012] “Semantically-supported Team Building in a KDD Virtual Environment”. CTS
2012, IEEE.[ISF] “A Virtual Mart for Knowledge Discovery in Databases”. Information System
Frontiers, Springer, accepted.