21
Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach Claudia Diamantini, Domenico Potena, Emanuele Storti [email protected] UNIVERSITA’ POLITECNICA DELLE MARCHE Dipartimento di Ingegneria Informatica, Gestionale e dell’Automazione Ancona, Italy UNIVERSITA’ POLITECNICA DELLE MARCHE Dipartimento di Ingegneria Informatica, Gestionale e dell’Automazione Ancona, Italy PlanLearn 2010, Lisbon, August 17

Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Embed Size (px)

DESCRIPTION

Full paper: http://boole.diiga.univpm.it/paper/planlearn2010.pdf Data Mining has reached a quite mature and sophisticated stage, with a plethora of techniques to deal with complex data analysis tasks. In contrast, the capability of users to fully exploit these techniques has not increased proportionately. For this reason the definition of methods and systems supporting users in Knowledge Discovery in Databases (KDD) activities is gaining increasing attention among researchers. The present work fits into this mainstream, proposing a methodology and the related system to support users in the composition of tools for forming valid and useful KDD processes. The basic pillar of the methodology is a similarity matching technique devised to recognize valid algorithmic sequences on the basis of their input/output pairs. Similarity is based on a semantic description of algorithms, their properties and interfaces, and is measured by a proper evaluation function. This allows to rank the candidate processes, so that users are provided with a criterion to choose the most suitable process with respect to their requests.

Citation preview

Page 1: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Supporting Users in KDD Processes Design:a Semantic Similarity Matching Approach

Claudia Diamantini, Domenico Potena, Emanuele [email protected]

UNIVERSITA’ POLITECNICA DELLE MARCHEDipartimento di Ingegneria Informatica, Gestionale e dell’Automazione

Ancona, Italy

UNIVERSITA’ POLITECNICA DELLE MARCHEDipartimento di Ingegneria Informatica, Gestionale e dell’Automazione

Ancona, Italy

PlanLearn 2010, Lisbon, August 17

Page 2: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

OutlineOutline

I. Introductiona) Aim of the workb) Scenario

II. Methodologya) General approachb) KDD ontologyc) Algorithm Matchmakingd) Process Composition

III. Applicationsa) Our frameworkb) Software & services

IV. Conclusion & Future Work

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Page 3: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Aim of the workAim of the work

How to automate Data Mining process? (Yang et al., 10 Challenging Problems for Data Mining Research, ICDM2005) filling the gap between knowledge

hidden in data and the needed know-how for its extraction

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Examples: KD for enterprises, E-science projects

New scenario: collaboration/distribution virtual organizations distributed teams and tools

Page 4: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Aim of the workAim of the work

Emanuele StortiPlanLearn 2010, Lisbon, August 17

KDD in a collaborative/distributed scenario complexity: users have various expertise heterogeneity: tools have different interfaces

KDDVM project: service-oriented platform for sharing, discovering, accessing, executing data analysis and knowledge discovery tools KDD tools produced by different organizations are remotely

accessible as basic services through standard protocols Formalization of experts' knowledge in a conceptual semantic

model, to support advanced services auto-parameter setting, coordination management, service

discovery, process composition

UsabilityIntegration

Page 5: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

ApproachApproach

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Separation of information in different layers:

KDD services

KDD algorithms

ID3_v1.2 ID3_v2.0 SVM_v.1.0

ID3 SVM

Benefits: loose-coupling, reusability Advanced services rely on such a layer:

service discovery process composition

Page 6: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Methodology in a nutshellMethodology in a nutshell

Emanuele Storti

Formalizing knowledge of KDD experts into an

ontology for describing algorithms, their interfaces and their relations

Defining techniques for matching algorithms with compatible interfaces

Defining a goal-oriented composition procedure which starts from user requests and produces a list of valid processes ranked according to some criteria

goaldataset

constraintsprocesses

PlanLearn 2010, Lisbon, August 17

Page 7: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

KDD Ontology (1)KDD Ontology (1)

KDDONTO is an ontology formalizing the domain of KDD algorithms: developed following a formal methodology

taking into account quality requirements

Main classes and relations: Algorithm, Method Task, Phase Data, DataFeature Performance has_input/has_output ...

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Page 8: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

KDDONTO is coinceived for supporting process composition Properties useful for representing algorithm's interfaces:

has_condition pre/postcondition for some input/output data not_with/not_before explicit incompatibilities between methods

KDD Ontology (2)KDD Ontology (2)

Properties useful for representing relations among data: part_of/has_part relations between a compound datum and its subcomponents in_constrast explicit incompatibilities between conditions

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Page 9: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Example: SOM (interface) has_input:

input_type: UNLABELED_DATASET has_precondition:

condition_type: FLOAT condition_strenght: 0.4

has_precondition: condition_type: NO_MISSING_VALUES condition_strenght: 1.0

has_input: input_type: VQ

has_input: input_type: LEARNING_RATE is_parameter: yes

has_output: output_type: VQ

KDD Ontology (3)KDD Ontology (3)

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Page 10: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Algorithm MatchmakingAlgorithm Matchmaking

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Linking algorithms with compatible interfaces A is compatible with B iff IN

B:

(either) INB is_parameter

(or) ∃ OUTA such that:

OUTA and IN

B are valid w.r.t. preconditions

OUTA and IN

B are similar datatypes (is_a, part_of)

AA

LDS

UDSLpart_of

?

Page 11: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Matchmaking: costMatchmaking: cost

How to evaluate the cost of a match? Degree of similarity between I/O weighted distance between IN and OUT weight(specialization) < weight(part_of)

Preconditions and their possible relaxation the higher the condition_strenght, the higher the cost

Performance of algorithms e.g.: the higher the complexity, the higher the cost

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Page 12: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Composition Procedure (1)Composition Procedure (1)

Goal-driven procedure for composing KDD processes, exploiting KDDONTO and matching functionalities produces a subset of all possible valid processes

An instance of Task class

e.g.: CLASSIFICATIONA Dataset type and set of instances of DataFeature class

e.g.: LabeledDataset{float, balanced, normalized,missing_values}

Pruning Criteria• max number of algorithms in a process;• max cost of a process;• max computational complexity

Emanuele StortiPlanLearn 2010, Lisbon, August 17

I. Definition of dataset, goal and user constraints

Page 13: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Composition Procedure (2)Composition Procedure (2)

task

ds

A iteration, algorithms are added to processes by exploiting matching functionalities

Emanuele StortiPlanLearn 2010, Lisbon, August 17

II. Process building Starts from task and goes backwards iteratively

Stop conditions: no process can be further expanded some process constraints are violated

Output only valid processes: satisfying the user goal compatible with the given dataset

Page 14: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Composition Procedure (3)Composition Procedure (3)

III. Process ranking

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Several possible ranking functions: number of algorithms in the process

process cost ( )

easiness-of-usage (function of the number of user-parameters)

overall computational complexity (function of the max complexity among the algorithms in the process)

∑i = 1

n

C i

Page 15: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

KDDVM FrameworkKDDVM Framework

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Basic services

Advanced services

BVQ_1.0 PCA_1.0 id3_1.2...

WSMatch Semantic Broker ...

Resources

KDDONTOUDDI

Clients

Support services

...

KDDComposer KDDWebDesigner BrokerClient OntoViewer BasicClient

Page 16: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

KDDComposerKDDComposer

Example scenario:Task: CLASSIFICATIONDataset: LabeledDatasetDataset features:

{float, normalized, missing_values,...}

Constraints: max 5 algorithms, ...

Results

Emanuele StortiPlanLearn 2010, Lisbon, August 17

a ranked list of many valid processes detailed information about each process, algorithm, match, connection

A prototype implementing the composition procedure

Page 17: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

WSMatchWSMatch

Emanuele StortiPlanLearn 2010, Lisbon, August 17

A WS implementing the matchmaking functionality

WSMatch KDDONTO

WS Client

match (A, B)?

cost

match (?, B)

match set={...}

Page 18: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

KDD WebDesignerKDD WebDesigner

Emanuele StortiPlanLearn 2010, Lisbon, August 17

search services by name/algorithm (call to SemanticBroker) check compatibility (WSMatch)

Page 19: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

ConclusionConclusion

Open environments and heterogeneous tools different interfaces: need of a common representation (service) abstraction for an high-level description of tools (algorithm)

Process composition procedure abstract processes are reusable:

steps to be performed with real tools composition patterns for solving certain types of problems valid and useful knowledge, valuable for both novice and experts users

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Algorithm matchmaking based on algorithms different similarity relations: subsumption, part_of verification of precondition/postconditions reusable for several applications

Page 20: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Future WorkFuture Work

Enhancement of KDDONTO's descriptive capabilities add information about statistical characteristics of data identify which algorithm is likely to perform best

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Translation of abstract processes into concrete workflows for each algorithm, find curresponding services check for possible mismatches, evaluate syntatic compatibility,

perform syntactic translations between different formats

Comprehensive tests evaluate effectiveness of composition procedure evaluate ranking functions

Page 21: Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Supporting Users in KDD Processes Design:a Semantic Similarity Matching Approach

Claudia Diamantini, Domenico Potena, Emanuele [email protected]

UNIVERSITA’ POLITECNICA DELLE MARCHEDipartimento di Ingegneria Informatica, Gestionale e dell’Automazione

Ancona, Italy

UNIVERSITA’ POLITECNICA DELLE MARCHEDipartimento di Ingegneria Informatica, Gestionale e dell’Automazione

Ancona, Italy

PlanLearn 2010, Lisbon, August 17