Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach

Supporting Users in KDD Processes Design:a Semantic Similarity Matching Approach

Claudia Diamantini, Domenico Potena, Emanuele [email protected]

UNIVERSITA’ POLITECNICA DELLE MARCHEDipartimento di Ingegneria Informatica, Gestionale e dell’Automazione

Ancona, Italy


Ancona, Italy

PlanLearn 2010, Lisbon, August 17

OutlineOutline

I. Introductiona) Aim of the workb) Scenario

II. Methodologya) General approachb) KDD ontologyc) Algorithm Matchmakingd) Process Composition

III. Applicationsa) Our frameworkb) Software & services

IV. Conclusion & Future Work

Emanuele StortiPlanLearn 2010, Lisbon, August 17

Aim of the workAim of the work

How to automate Data Mining process? (Yang et al., 10 Challenging Problems for Data Mining Research, ICDM2005) filling the gap between knowledge

hidden in data and the needed know-how for its extraction


Examples: KD for enterprises, E-science projects

New scenario: collaboration/distribution virtual organizations distributed teams and tools

Aim of the workAim of the work


KDD in a collaborative/distributed scenario complexity: users have various expertise heterogeneity: tools have different interfaces

KDDVM project: service-oriented platform for sharing, discovering, accessing, executing data analysis and knowledge discovery tools KDD tools produced by different organizations are remotely

accessible as basic services through standard protocols Formalization of experts' knowledge in a conceptual semantic

model, to support advanced services auto-parameter setting, coordination management, service

discovery, process composition

UsabilityIntegration

ApproachApproach


Separation of information in different layers:

KDD services

KDD algorithms

ID3_v1.2 ID3_v2.0 SVM_v.1.0

ID3 SVM

Benefits: loose-coupling, reusability Advanced services rely on such a layer:

service discovery process composition

Methodology in a nutshellMethodology in a nutshell

Emanuele Storti

Formalizing knowledge of KDD experts into an

ontology for describing algorithms, their interfaces and their relations

Defining techniques for matching algorithms with compatible interfaces

Defining a goal-oriented composition procedure which starts from user requests and produces a list of valid processes ranked according to some criteria

goaldataset

constraintsprocesses


KDD Ontology (1)KDD Ontology (1)

KDDONTO is an ontology formalizing the domain of KDD algorithms: developed following a formal methodology

taking into account quality requirements

Main classes and relations: Algorithm, Method Task, Phase Data, DataFeature Performance has_input/has_output ...


KDDONTO is coinceived for supporting process composition Properties useful for representing algorithm's interfaces:

has_condition pre/postcondition for some input/output data not_with/not_before explicit incompatibilities between methods


Properties useful for representing relations among data: part_of/has_part relations between a compound datum and its subcomponents in_constrast explicit incompatibilities between conditions


Example: SOM (interface) has_input:

input_type: UNLABELED_DATASET has_precondition:

condition_type: FLOAT condition_strenght: 0.4

has_precondition: condition_type: NO_MISSING_VALUES condition_strenght: 1.0

has_input: input_type: VQ

has_input: input_type: LEARNING_RATE is_parameter: yes

has_output: output_type: VQ



Algorithm MatchmakingAlgorithm Matchmaking


Linking algorithms with compatible interfaces A is compatible with B iff IN

B:

(either) INB is_parameter

(or) ∃ OUTA such that:

OUTA and IN

B are valid w.r.t. preconditions

OUTA and IN

B are similar datatypes (is_a, part_of)

AA

LDS

UDSLpart_of

?

Matchmaking: costMatchmaking: cost

How to evaluate the cost of a match? Degree of similarity between I/O weighted distance between IN and OUT weight(specialization) < weight(part_of)

Preconditions and their possible relaxation the higher the condition_strenght, the higher the cost

Performance of algorithms e.g.: the higher the complexity, the higher the cost


Composition Procedure (1)Composition Procedure (1)

Goal-driven procedure for composing KDD processes, exploiting KDDONTO and matching functionalities produces a subset of all possible valid processes

An instance of Task class

e.g.: CLASSIFICATIONA Dataset type and set of instances of DataFeature class

e.g.: LabeledDataset{float, balanced, normalized,missing_values}

Pruning Criteria• max number of algorithms in a process;• max cost of a process;• max computational complexity


I. Definition of dataset, goal and user constraints


task

ds

A iteration, algorithms are added to processes by exploiting matching functionalities


II. Process building Starts from task and goes backwards iteratively

Stop conditions: no process can be further expanded some process constraints are violated

Output only valid processes: satisfying the user goal compatible with the given dataset


III. Process ranking


Several possible ranking functions: number of algorithms in the process

process cost ( )

easiness-of-usage (function of the number of user-parameters)

overall computational complexity (function of the max complexity among the algorithms in the process)

∑i = 1

n

C i

KDDVM FrameworkKDDVM Framework


Basic services

Advanced services

BVQ_1.0 PCA_1.0 id3_1.2...

WSMatch Semantic Broker ...

Resources

KDDONTOUDDI

Clients

Support services

...

KDDComposer KDDWebDesigner BrokerClient OntoViewer BasicClient

KDDComposerKDDComposer

Example scenario:Task: CLASSIFICATIONDataset: LabeledDatasetDataset features:

{float, normalized, missing_values,...}

Constraints: max 5 algorithms, ...

Results


a ranked list of many valid processes detailed information about each process, algorithm, match, connection

A prototype implementing the composition procedure

WSMatchWSMatch


A WS implementing the matchmaking functionality

WSMatch KDDONTO

WS Client

match (A, B)?

cost

match (?, B)

match set={...}

KDD WebDesignerKDD WebDesigner


search services by name/algorithm (call to SemanticBroker) check compatibility (WSMatch)

ConclusionConclusion

Open environments and heterogeneous tools different interfaces: need of a common representation (service) abstraction for an high-level description of tools (algorithm)

Process composition procedure abstract processes are reusable:

steps to be performed with real tools composition patterns for solving certain types of problems valid and useful knowledge, valuable for both novice and experts users


Algorithm matchmaking based on algorithms different similarity relations: subsumption, part_of verification of precondition/postconditions reusable for several applications

Future WorkFuture Work

Enhancement of KDDONTO's descriptive capabilities add information about statistical characteristics of data identify which algorithm is likely to perform best


Translation of abstract processes into concrete workflows for each algorithm, find curresponding services check for possible mismatches, evaluate syntatic compatibility,

perform syntactic translations between different formats

Comprehensive tests evaluate effectiveness of composition procedure evaluate ranking functions

Supporting Users in KDD Processes Design:a Semantic Similarity Matching Approach

Claudia Diamantini, Domenico Potena, Emanuele [email protected]


Ancona, Italy


Ancona, Italy


Technology

Supporting Users in KDD Processes Design: a Semantic Similarity Matching Approach