31
DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS University of Calabria ITALY [email protected] Grid-Based Data Mining and the KNOWLEDGE GRID Framework Minneapolis, September 18, 2003

DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

  • Upload
    ama

  • View
    63

  • Download
    0

Embed Size (px)

DESCRIPTION

Grid-Based Data Mining and the KNOWLEDGE GRID Framework. DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS University of Calabria ITALY [email protected]. Minneapolis, September 18, 2003. OUTLINE. Introduction - PowerPoint PPT Presentation

Citation preview

Page 1: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

DOMENICO TALIA(joint work with M. Cannataro, A. Congiusta, P.

Trunfio)

DEISUniversity of Calabria

ITALY [email protected]

Grid-Based Data Mining and the

KNOWLEDGE GRID Framework

Minneapolis, September 18, 2003

Page 2: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

2

OUTLINE

Introduction

Parallel and Distributed Data Mining on Grids

The KNOWLEDGE GRID

KNOWLEDGE GRID Architecture

KNOWLEDGE GRID Services

KNOWLEDGE GRID Tools

VEGA

Current Work

Conclusion

Page 3: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

3

Data mining is often a compute intensive task.

When large data sets are coupled with

geographic distribution of data, users, and systems,

it is necessary to combine different technologies for

implementing high-performance distributed knowledge

discovery systems (PDKD).

Distributed data mining tools are available but most of them do not

run on Grids.

PARALLEL & DISTRIBUTED DATA MINING

Page 4: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

4

“By providing scalable, secure, high-performance mechanisms for discovering and negotiating access to remote resources, the Grid promises to make it possible for scientific collaborations to share resources on an unprecedented scale, and for geographically distributed groups to work together in ways that were previously impossible”

Ian Foster

WHAT IS A GRIDS ?

Page 5: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

5

Grid middleware targets technical challenges in areas such as communication, scheduling, security, information and data access, and fault detection.

Efforts are needed for the development of knowledge discovery tools and services on the Grid.

PARALLEL & DISTRIBUTED DM ON GRIDS

Grid-aware PDKD systems

Page 6: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

6

PARALLEL & DISTRIBUTED DM ON GRIDS

The basic principles that motivate the architecture design of

the grid-aware PDKD systems

Data heterogeneity and large data size

Algorithm integration and independence

Grid awareness

Openness

Scalability

Security and data privacy.

Page 7: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

7

WHAT THE GRID OFFERS

Grid infrastructure tools, such as the Globus Toolkit and Legion, provide basic services that can be effectively used in the development of a data mining applications.

Data Grid middleware (e.g. Globus Data Grid) implements data management architectures based on two main services: storage system and metadata management.

Data Grids are useful, but are not sufficient for data mining.

Page 8: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

8

KNOWLEDGE GRID - a PDKD architecture that integrates data mining techniques and computational Grid resources.

In the KNOWLEDGE GRID architecture data mining tools are integrated with lower-level Grid mechanisms and services and exploit Data Grid services.

This approach benefits from "standard" Grid services and offers an open PDKD architecture that can be configured on top of generic Grid middleware.

THE KNOWLEDGE GRID

Page 9: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

9

KNOWLEDGE GRID ENVIRONMENT

A KNOWLEDGE GRID application uses:

A set of KNOWLEDGE GRID-enabled computers - K-GRID nodes

declaring their availability to participate to some PDKD computation, that are connected by

A Grid infrastructure

offering basic grid-services (authentication, data location, service level negotiation) and implementing the

KNOWLEDGE GRID services.

Page 10: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

10

KNOWLEDGE GRID ENVIRONMENT

LAN

Cluster containing data setsand/or DM algorithms

K-GRID node Generic Grid node

Basic Grid Infrastucture

K-GRID node

Local Resources

Cluster Element

Cluster Element

Cluster Element Grid Middleware

K-GRID tools

KNOWLEDGE GRID services

Local Resources

Grid Middleware

K-GRID toolsGrid Middleware

Page 11: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

11

KNOWLEDGE GRID SERVICES

The KNOWLEDGE GRID services are organized in two hierarchic layers :

• Core K-Grid layer and

• High-level K-Grid layer.

The former refers to services directly implemented on the

top of generic Grid services.

The latter is used to describe, develop, and execute PDKD

computations over the KNOWLEDGE GRID.

Page 12: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

12

KNOWLEDGE GRID ARCHITECTURE

Generic Grid Services

K N O W L E D G E G R I D

DASData AccessService

TAASTools and Algorithms

Access Service

EPMSExecution Plan

Management Service

RPSResult

Presentation Service

KDSKnowledge Directory

Service

RAEMSResource Alloc.Execution Mng.

KEPRKMR KBR

High level K-Grid layer

Core K-Grid layer

Resource MetadataExecution Plan MetadataModel Metadata

Page 13: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

13

KNOWLEDGE GRID SERVICES

Core K-Grid layer services:

• Knowledge directory service (KDS). Extends the basic

Globus MDS and GIS services to maintain a description of all

data and tools used in the KNOWLEDGE GRID.

• Resource allocation and execution management service

(RAEMS). RAEMS services are used to find a mapping

between an execution plan and available resources.

• The Core K-Grid layer manages metadata describing features of data sources, third party data mining tools, data management, and data visualization tools and algorithms.

Page 14: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

14

KNOWLEDGE GRID SERVICES

High-level K-grid layer services:

Data Access •Search, selection (Data search services), extraction, transformation

and delivery (Data extraction services) of data to be mined.

Tools and algorithms access •Search, selection, and downloading of data mining tools and

algorithms.

Execution Plan Management •Generation of a set of different execution plans that satisfy user, data,

and algorithms requirements and constraints.

Results presentation •Specifies how to generate, present and visualize the PDKD results

(rules, associations, models, classification, etc.).

Page 15: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

15

KNOWLEDGE GRID OBJECTS

We use the Globus MDS model only for generic Grid resources, but extended it with an XML metadata model to manage specific KNOWLEDGE GRID resources.

Metadata describing relevant K-Grid objects, such as data sources and data mining tools, are implemented using both LDAP and XML.

The (Knowledge Metadata Repository) KMR is implemented by LDAP entries and XML documents. The LDAP portion is used as a first point of access to more specific information represented by XML documents.

Page 16: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

16

APPLICATION COMPOSITION STEPS

Search and selection of resources

Search and selection of resources

DAS / TAAS

EPMS

KMRs

TMR

Design of the PDKD

computation

Design of the PDKD

computation

Metadata aboutK-grid resourcesMetadata aboutK-grid resources

Metadata about the selected

K-grid resources

Metadata about the selected

K-grid resources

Execution PlanExecution Plan

KEPR

Page 17: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

17

APPLICATION EXECUTION STEPS

RAEMS

GRAM

RPS

Execution Plan optimization and

translation

Execution Plan optimization and

translation

Execution of the PDKD

computation

Execution of the PDKD

computation

Results presentation

Results presentation

Execution PlanExecution Plan

RSL scriptRSL script

Computation results

Computation results

KEPR

KBR

Page 18: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

18

A prototype version f the KNOWLEDGE GRID architecture have been implemented using Java and the Globus Toolkit 2.x.

To allow a user to build a grid-based data mining application, we developed a toolset named VEGA (a Visual Environment for Grid Applications).

VEGA offers users support for :

task composition - definition of the entities involved in the computation and specification of relations among them;

checking of the consistency of the planned task;

generation of the execution plan for a data mining task.

execution of the execution plan through the resource allocation manager of the underlying grid.

A TOOL : VEGA

Page 19: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

19

Objects:

Links:

Hosts

Software

Data

Output

Input

Execute

File Transfer

Objects represent resources

Links represent relations among resources

VEGA : OBJECTS and LINKS

Page 20: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

20

Hosts pane

Resources pane

VEGA

Page 21: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

21

A KGrid application can be composed of several workspaces

VEGA

Page 22: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

22

... <Software> <name>AutoClass</name> <description>Unsupervised Bayesian Classifier </description> <release> <number major=“3” minor=“3” patch=“3”/> <date>01 May 00</date> </release> <author>Nasa Ames Research Center</author> <hostname>icarus.isi.cs.cnr.it</hostname> <executablePath>/share/software/autoclass-c/autoclass </executablePath> <manualPath>/share/software/autoclass-c/read-me.text </manualPath> ... </Software>

XML METADATA in a KMR

Page 23: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

23

<ExecutionPlan> ... <Task ep:label="ws1_dt2"> <DataTransfer> <Source ep:href="g1../Unidb.xml" ep:title="Unidb on g1.isi.cs.cnr.it"/> <Destination ep:href="k2../Unidb.xml“ ep:title="Unidb on k2.deis.unical.it"/> ... </DataTransfer> </Task> ... <Task ep:label="ws2_c2"> <Computation> <Program ep:href="k2../IMiner.xml" ep:title="IMiner on k2.deis.unical.it"/> <Input ep:href="k2../Unidb.xml" ep:title="Unidb on k2.deis.unical.it"/> ... <Output ep:href="k2../IMiner.out.xml" ep:title="IMiner.out on k2.deis.unical.it"/> </Computation> </Task> ... <TaskLink ep:from="ws1_dt2" ep:to="ws2_c2"/> ...</ExecutionPlan>

XML EXECUTION PLAN

Page 24: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

24

+...(&(resourceManagerContact=g1.isi.cs.cnr.it) (subjobStartType=strict-barrier) (label=ws1_dt2) (executable=$(GLOBUS_LOCATION)/bin/globus-url-copy) (arguments=-vb –notpt gsiftp://g1.isi.cs.cnr.it/.../Unidb gsiftp://k2.deis.unical.it/.../Unidb ))...(&(resourceManagerContact=k2.deis.unical.it) (subjobStartType=strict-barrier) (label=ws2_c2) (executable=.../IMiner) ... ))...

A GENERATED RSL SCRIPT

Page 25: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

25

APPLICATION EXECUTION

Page 26: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

26

Some things we have done recently

VEGA :

Support for more complex computation layouts,

Execution plan optimization,

Abstract resources definition and use.

KNOWLEDGE GRID :

A peer-to-peer system for presence management and resource discovery on the Grid,

A tool for optimized file transfer on the Grid based on GridFTP,

A data mining ontology and an associated tool.

ON GOING WORK : OTHER TOOLS

Page 27: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

27

ON GOING WORK

OGSA and KNOWLEDGE DISCOVERY SERVICES

The KNOWLEDGE GRID is an abstract service-based Grid architecture

that does not limit the user in developing and using service-based

knowledge discovery applications.

We are defining a set of Grid Services that export functionalities and

operations of the KNOWLEDGE GRID.

Each of the KNOWLEDGE GRID services is exposed as a persistent

service, using the OGSA conventions and mechanisms.

We intend to offer those OGSA-Compliant services for impementing

distributed Data Mining applications and Knowledge Discovery processes

on Grids.

Page 28: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

28

CONCLUSION

Parallel and distributed data mining suites and computational grid technology are two critical elements of future high-performance computing environments for

• e-science (data-intensive experiments)• e-business (on-line services)• virtual organizations support (virtual teams, virtual enterprises)

Knowledge Grids will enable entirely new classes of advanced

applications for dealing with the data deluge.

The Grid is not yet another distributed computing system: it is a medium to dynamically share heterogeneous resources, services, and knowledge.

Page 29: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

29

CONCLUSION

Grids are coupling computation-oriented services with data-oriented services and knowledge-based services.

This trend enlarges the Grid application scenario and offer new opportunities for high-level applications.

We are much more able to store data than to extract knowledge from it.

The KNOWLEDGE GRID is a framework for the

unification of knowledge discovery and grid technologies

helping us to climb some mountain of data.

Page 30: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

30

MAIN REFERENCES

M. Cannataro, D. Talia,  The Knowledge Grid, Communications of the ACM, 46(1), 2003.

M Cannataro, D. Talia, P. Trunfio, Distributed Data Mining on the Grid, Future Generation Computer Systems, 18(8), 2002.

D. Talia, The Open Grid Services Architecture-Where the Grid Meets the Web, IEEE Internet Computing, 6(6), 2002.

www.icar.cnr.it/kgrid

Page 31: DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS

31

THANKS