Upload
emanuele-storti
View
257
Download
0
Embed Size (px)
DESCRIPTION
Full paper: http://boole.diiga.univpm.it/paper/planlearn2010.pdf Data Mining has reached a quite mature and sophisticated stage, with a plethora of techniques to deal with complex data analysis tasks. In contrast, the capability of users to fully exploit these techniques has not increased proportionately. For this reason the definition of methods and systems supporting users in Knowledge Discovery in Databases (KDD) activities is gaining increasing attention among researchers. The present work fits into this mainstream, proposing a methodology and the related system to support users in the composition of tools for forming valid and useful KDD processes. The basic pillar of the methodology is a similarity matching technique devised to recognize valid algorithmic sequences on the basis of their input/output pairs. Similarity is based on a semantic description of algorithms, their properties and interfaces, and is measured by a proper evaluation function. This allows to rank the candidate processes, so that users are provided with a criterion to choose the most suitable process with respect to their requests.
Citation preview
Supporting Users in KDD Processes Design:a Semantic Similarity Matching Approach
Claudia Diamantini, Domenico Potena, Emanuele [email protected]
UNIVERSITA’ POLITECNICA DELLE MARCHEDipartimento di Ingegneria Informatica, Gestionale e dell’Automazione
Ancona, Italy
UNIVERSITA’ POLITECNICA DELLE MARCHEDipartimento di Ingegneria Informatica, Gestionale e dell’Automazione
Ancona, Italy
PlanLearn 2010, Lisbon, August 17
OutlineOutline
I. Introductiona) Aim of the workb) Scenario
II. Methodologya) General approachb) KDD ontologyc) Algorithm Matchmakingd) Process Composition
III. Applicationsa) Our frameworkb) Software & services
IV. Conclusion & Future Work
Emanuele StortiPlanLearn 2010, Lisbon, August 17
Aim of the workAim of the work
How to automate Data Mining process? (Yang et al., 10 Challenging Problems for Data Mining Research, ICDM2005) filling the gap between knowledge
hidden in data and the needed know-how for its extraction
Emanuele StortiPlanLearn 2010, Lisbon, August 17
Examples: KD for enterprises, E-science projects
New scenario: collaboration/distribution virtual organizations distributed teams and tools
Aim of the workAim of the work
Emanuele StortiPlanLearn 2010, Lisbon, August 17
KDD in a collaborative/distributed scenario complexity: users have various expertise heterogeneity: tools have different interfaces
KDDVM project: service-oriented platform for sharing, discovering, accessing, executing data analysis and knowledge discovery tools KDD tools produced by different organizations are remotely
accessible as basic services through standard protocols Formalization of experts' knowledge in a conceptual semantic
model, to support advanced services auto-parameter setting, coordination management, service
discovery, process composition
UsabilityIntegration
ApproachApproach
Emanuele StortiPlanLearn 2010, Lisbon, August 17
Separation of information in different layers:
KDD services
KDD algorithms
ID3_v1.2 ID3_v2.0 SVM_v.1.0
ID3 SVM
Benefits: loose-coupling, reusability Advanced services rely on such a layer:
service discovery process composition
Methodology in a nutshellMethodology in a nutshell
Emanuele Storti
Formalizing knowledge of KDD experts into an
ontology for describing algorithms, their interfaces and their relations
Defining techniques for matching algorithms with compatible interfaces
Defining a goal-oriented composition procedure which starts from user requests and produces a list of valid processes ranked according to some criteria
goaldataset
constraintsprocesses
PlanLearn 2010, Lisbon, August 17
KDD Ontology (1)KDD Ontology (1)
KDDONTO is an ontology formalizing the domain of KDD algorithms: developed following a formal methodology
taking into account quality requirements
Main classes and relations: Algorithm, Method Task, Phase Data, DataFeature Performance has_input/has_output ...
Emanuele StortiPlanLearn 2010, Lisbon, August 17
KDDONTO is coinceived for supporting process composition Properties useful for representing algorithm's interfaces:
has_condition pre/postcondition for some input/output data not_with/not_before explicit incompatibilities between methods
KDD Ontology (2)KDD Ontology (2)
Properties useful for representing relations among data: part_of/has_part relations between a compound datum and its subcomponents in_constrast explicit incompatibilities between conditions
Emanuele StortiPlanLearn 2010, Lisbon, August 17
Example: SOM (interface) has_input:
input_type: UNLABELED_DATASET has_precondition:
condition_type: FLOAT condition_strenght: 0.4
has_precondition: condition_type: NO_MISSING_VALUES condition_strenght: 1.0
has_input: input_type: VQ
has_input: input_type: LEARNING_RATE is_parameter: yes
has_output: output_type: VQ
KDD Ontology (3)KDD Ontology (3)
Emanuele StortiPlanLearn 2010, Lisbon, August 17
Algorithm MatchmakingAlgorithm Matchmaking
Emanuele StortiPlanLearn 2010, Lisbon, August 17
Linking algorithms with compatible interfaces A is compatible with B iff IN
B:
(either) INB is_parameter
(or) ∃ OUTA such that:
OUTA and IN
B are valid w.r.t. preconditions
OUTA and IN
B are similar datatypes (is_a, part_of)
AA
LDS
UDSLpart_of
?
Matchmaking: costMatchmaking: cost
How to evaluate the cost of a match? Degree of similarity between I/O weighted distance between IN and OUT weight(specialization) < weight(part_of)
Preconditions and their possible relaxation the higher the condition_strenght, the higher the cost
Performance of algorithms e.g.: the higher the complexity, the higher the cost
Emanuele StortiPlanLearn 2010, Lisbon, August 17
Composition Procedure (1)Composition Procedure (1)
Goal-driven procedure for composing KDD processes, exploiting KDDONTO and matching functionalities produces a subset of all possible valid processes
An instance of Task class
e.g.: CLASSIFICATIONA Dataset type and set of instances of DataFeature class
e.g.: LabeledDataset{float, balanced, normalized,missing_values}
Pruning Criteria• max number of algorithms in a process;• max cost of a process;• max computational complexity
Emanuele StortiPlanLearn 2010, Lisbon, August 17
I. Definition of dataset, goal and user constraints
Composition Procedure (2)Composition Procedure (2)
task
ds
A iteration, algorithms are added to processes by exploiting matching functionalities
Emanuele StortiPlanLearn 2010, Lisbon, August 17
II. Process building Starts from task and goes backwards iteratively
Stop conditions: no process can be further expanded some process constraints are violated
Output only valid processes: satisfying the user goal compatible with the given dataset
Composition Procedure (3)Composition Procedure (3)
III. Process ranking
Emanuele StortiPlanLearn 2010, Lisbon, August 17
Several possible ranking functions: number of algorithms in the process
process cost ( )
easiness-of-usage (function of the number of user-parameters)
overall computational complexity (function of the max complexity among the algorithms in the process)
∑i = 1
n
C i
KDDVM FrameworkKDDVM Framework
Emanuele StortiPlanLearn 2010, Lisbon, August 17
Basic services
Advanced services
BVQ_1.0 PCA_1.0 id3_1.2...
WSMatch Semantic Broker ...
Resources
KDDONTOUDDI
Clients
Support services
...
KDDComposer KDDWebDesigner BrokerClient OntoViewer BasicClient
KDDComposerKDDComposer
Example scenario:Task: CLASSIFICATIONDataset: LabeledDatasetDataset features:
{float, normalized, missing_values,...}
Constraints: max 5 algorithms, ...
Results
Emanuele StortiPlanLearn 2010, Lisbon, August 17
a ranked list of many valid processes detailed information about each process, algorithm, match, connection
A prototype implementing the composition procedure
WSMatchWSMatch
Emanuele StortiPlanLearn 2010, Lisbon, August 17
A WS implementing the matchmaking functionality
WSMatch KDDONTO
WS Client
match (A, B)?
cost
match (?, B)
match set={...}
KDD WebDesignerKDD WebDesigner
Emanuele StortiPlanLearn 2010, Lisbon, August 17
search services by name/algorithm (call to SemanticBroker) check compatibility (WSMatch)
ConclusionConclusion
Open environments and heterogeneous tools different interfaces: need of a common representation (service) abstraction for an high-level description of tools (algorithm)
Process composition procedure abstract processes are reusable:
steps to be performed with real tools composition patterns for solving certain types of problems valid and useful knowledge, valuable for both novice and experts users
Emanuele StortiPlanLearn 2010, Lisbon, August 17
Algorithm matchmaking based on algorithms different similarity relations: subsumption, part_of verification of precondition/postconditions reusable for several applications
Future WorkFuture Work
Enhancement of KDDONTO's descriptive capabilities add information about statistical characteristics of data identify which algorithm is likely to perform best
Emanuele StortiPlanLearn 2010, Lisbon, August 17
Translation of abstract processes into concrete workflows for each algorithm, find curresponding services check for possible mismatches, evaluate syntatic compatibility,
perform syntactic translations between different formats
Comprehensive tests evaluate effectiveness of composition procedure evaluate ranking functions
Supporting Users in KDD Processes Design:a Semantic Similarity Matching Approach
Claudia Diamantini, Domenico Potena, Emanuele [email protected]
UNIVERSITA’ POLITECNICA DELLE MARCHEDipartimento di Ingegneria Informatica, Gestionale e dell’Automazione
Ancona, Italy
UNIVERSITA’ POLITECNICA DELLE MARCHEDipartimento di Ingegneria Informatica, Gestionale e dell’Automazione
Ancona, Italy
PlanLearn 2010, Lisbon, August 17