29
P. 1 Computer Science and Engineering DataGrid Lab A Middleware for Developing and Deploying Scalable Remote Mining Services A Middleware for Developing and Deploying Scalable Remote Mining Services Leonid Glimcher Gagan Agrawal

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

Embed Size (px)

Citation preview

Page 1: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 1

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

A Middleware for Developing and Deploying Scalable Remote Mining Services

Leonid Glimcher Gagan Agrawal

Page 2: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 2

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Abundance of Data

Data generated everywhere:• Sensors (intrusion detection,

satellites)• Scientific simulations (fluid

dynamics, molecular dynamics)• Business transactions

(purchases, market trends)

Analysis needed to translate data into knowledge

Growing data size creates problems for datamining

Page 3: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 3

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Grid and Cloud Computing

Provides access to otherwise inaccessible:

• Resources

• Data

Goal of Grid Computing:

• to provide a stable foundation for distributed services

Price for such foundation:

• Standards needed for distributed service integration (OGSA & WSRF)

Cloud computing: access resources (data storage and processing) as services available for consumption

Page 4: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 4

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Remote Data Analysis

• Remote data analysis– Grid is a good fit

Details can be very tedious:• Data retrieval, movement and caching• Parallel data processing• Resource allocation• Application configuration

Middleware can be useful to abstract away details

Page 5: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 5

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Our Approach

Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Implementation of Datamining Engines in Grid)

Repository cluster

Compute cluster

Middleware user

Page 6: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 6

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Outline

• Motivation• Introduction• Middleware Overview• Challenges• Experimental Evaluation• Related work• Conclusion

Page 7: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 7

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

FREERIDE-G Grid Service

Grid computing standards and infrastructures:• OGSA vs. WSRF (merged Grid and Web Services)

Data hosts compliant with repository standards (SRB)

MPICH-G2 -- pre-WS mechanism to support MPI

Globus Toolkit and MPICH-G2 -- most fitting infrastructure for Grid Service conversion of FREERIDE-G

Page 8: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 8

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

SDSC Storage Resource Broker

Standard for remote data:• Storage• Access

Data can be distributed across organizations and heterogeneous storage systems

Used to host data for FREERIDE-G

Access provided through client API

Page 9: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 9

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

FREERIDE-G Processing Structure

KEY observation: most data mining algorithms follow canonical loop

Middleware API: • Subset of data to be

processed• Reduction object • Local and global reduction

operations • Iterator

Derived from precursor system FREERIDE

While( ) {

forall( data instances d) {

I = process(d)

R(I) = R(I) op d

}

…….

}

Page 10: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 10

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

FREERIDE-G Evolution

FREERIDE

data stored locally

FREERIDE-G• ADR responsible for remote data retrieval• SRB responsible for remote data retrieval

FREERIDE-G grid service

Grid service featuring• Load balancing• Data integration

Page 11: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 11

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

EvolutionFREERIDE FREERIDE-G-ADR

FREERIDE-G-SRB FREERIDE-G-GT

Application

Data

ADR

SRB

Globus

Page 12: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 12

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

FREERIDE-G System Architecture

Page 13: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 13

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Implementation Challenges I

• Interaction with Code Repository– Simplified Wrapper and Interface Generator– XML descriptors of API functions– Each API function wrapped in own class

C++

Java

C++

SWIGXML

Page 14: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 14

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Implementation Challenges

Integration with MPICH-G2 and Globus Toolkit

• Supports MPI• Deployed through pre-

WS Globus components• Hides potential

heterogeneity in:– service startup – management

Page 15: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 15

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

FREERIDE-G Applications

VortexPro:• Finds vortices in

volumetric fluid/gas flow datasets

Kmeans Clustering:• Clusters points based on

Euclidean distance in an attribute space

EM Clustering:• Another distance-based

clustering algorithm

Page 16: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 16

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Experimental setup

Organizational Grid:

• Data hosted on Opteron 250 cluster

• Processed on Opteron 254 cluster

• Connected using 2 10 GB optical fibers

Goals:

• Demonstrate parallel scalability of applications

• Evaluate overhead of using MPICH-G2 and Globus Toolkit deployment mechanisms

Repository cluster

Compute cluster

Page 17: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 17

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Scalability Evaluation

Scalable implementations with respect to numbers of:

• Compute nodes,• Data repository nodes.

Vortex Detection with 14.8 GB dataset.

Kmeans Clustering with 6.4 GB dataset (bottom)

0

100

200

300

400

Execution Time (sec)

1 2 4 8

Data Repository Nodes (#)

1 compute 2 compute4 compute 8 compute

Page 18: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 18

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Scalability Evaluation II

0

200

400

600

800

Execution Time (sec)

1 2 4 8

Data Repository Nodes (#)

1 compute 2 compute4 compute 8 compute

EM Clustering (25.6 GB)

0

1000

2000

3000

Execution Time (sec)

1 2 4 8

Data Repository Nodes (#)

1 compute 2 compute4 compute 8 compute

Kmeans Clustering (25.6 GB)

Sub-linear speedup for scaled up compute nodes explained by

sequential scheduling of concurrent data reads

Page 19: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 19

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Deployment Overhead Evaluation

Clearly a small overhead associated with using Globus and MPICH-G2 for middleware deployment.

Kmeans Clustering with 6.4 GB dataset: 18-20%.

0

100

200

300

Execution Time (sec)

4 data - 4compute

4 data - 8compute

8 data - 8compute

GT No GT

Page 20: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 20

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Deployment Overhead Evaluation II

Vortex Detection with 14.8 GB dataset: 17-20%.

Overhead: • scales with the total

execution time• doesn’t effect overall data

processing service scalability

0

20

40

60

80

100

Execution Time (sec)

4 data - 4compute

4 data - 8compute

8 data - 8compute

GT No GT

Page 21: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 21

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Outline

• Motivation• Introduction• Middleware Overview• Challenges• Experimental Evaluation• Related work• Conclusion

Page 22: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 22

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Grid Computing

THE GRID Standards:• Open Grid Service Architecture• Web Services Resource Framework

Implementations:• Globus toolkit, WSRF.NET, WSRF::Lite

Other Grid Systems:• Condor and Legion,• Workflow composition systems:

– GridLab, Nimrod/G (GridBus), Cactus, GRIST

Page 23: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 23

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Data Grids

Metadata cataloging:• Artemis projectRemote data retrieval:• SRB, SEMPLAR• DataCutter, STORMStream processing midleware:• GATES• dQUOBData integration:• Automatic wrapper generation

Page 24: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 24

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Remote Data in Grid

KnowledgeGrid tool-set:• Developing datamining workflows (composition)

GridMiner toolkit:• Composition of distributed, parallel workflows

DiscoveryNet layer:• Creation, deployment, management of services

DataMiningGrid framework:• Distributed knowledge discovery

Other projects partially overlap in goals with ours

Page 25: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 25

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Conclusions

FREERIDE-G – middleware for scalable processing of remote data -- now as a grid service

• Compliance with grid and repository standards• Support for high-end processing• Ease use of parallel configurations• Hide details of data movement and caching• Scalable performance with respect to data and

compute nodes• Low deployment overheads as compared to non-

grid version

Page 26: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 26

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

SRB Data Host

SRB Master:• Connection establishment• Authentication• Forks agent to service I/OSRB Agent:• Performs remote I/O• Services multiple client API

requestsMCAT:• Catalogs data associated with

datasets, users, resources• Services metadata queries

(through SRB agent)

Page 27: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 27

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Compute Node

More compute nodes than data hosts

Each node:

1. Registers IO (from index)

2. Connects to data host

While (chunks to process)

1. Dispatch IO request(s)

2. Poll pending IO

3. Process retrieved chunks

Page 28: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 28

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

FREERIDE-G in Action

SRB Agent

SRB Agent

SRB MasterMCAT

Data Host I/O Registration

Connection establishment

While (more chunks to process)

I/O request dispatched

Pending I/O polled

Retrieved data chunks

analyzed

Compute Node

Compute Node

Page 29: Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing

P. 29

Computer Science and Engineering

DataGrid LabA Middleware for Developing and Deploying Scalable

Remote Mining Services

Conversion Challenges

1. Interfacing C++ middleware with GT 4.0 (Java)• Swig to generate Java Native Interface (JNI)

wrapper

2. Integrating data mining service through WSDL interface• Encapsulate service as a resource

3. Middleware deployment using GT 4.0• Use Web Service – Grid Resource Allocation

Module for deployment