Upload
claire-owen
View
226
Download
0
Embed Size (px)
Citation preview
P. 1
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
A Middleware for Developing and Deploying Scalable Remote Mining Services
Leonid Glimcher Gagan Agrawal
P. 2
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Abundance of Data
Data generated everywhere:• Sensors (intrusion detection,
satellites)• Scientific simulations (fluid
dynamics, molecular dynamics)• Business transactions
(purchases, market trends)
Analysis needed to translate data into knowledge
Growing data size creates problems for datamining
P. 3
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Grid and Cloud Computing
Provides access to otherwise inaccessible:
• Resources
• Data
Goal of Grid Computing:
• to provide a stable foundation for distributed services
Price for such foundation:
• Standards needed for distributed service integration (OGSA & WSRF)
Cloud computing: access resources (data storage and processing) as services available for consumption
P. 4
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Remote Data Analysis
• Remote data analysis– Grid is a good fit
Details can be very tedious:• Data retrieval, movement and caching• Parallel data processing• Resource allocation• Application configuration
Middleware can be useful to abstract away details
P. 5
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Our Approach
Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Implementation of Datamining Engines in Grid)
Repository cluster
Compute cluster
Middleware user
P. 6
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Outline
• Motivation• Introduction• Middleware Overview• Challenges• Experimental Evaluation• Related work• Conclusion
P. 7
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
FREERIDE-G Grid Service
Grid computing standards and infrastructures:• OGSA vs. WSRF (merged Grid and Web Services)
Data hosts compliant with repository standards (SRB)
MPICH-G2 -- pre-WS mechanism to support MPI
Globus Toolkit and MPICH-G2 -- most fitting infrastructure for Grid Service conversion of FREERIDE-G
P. 8
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
SDSC Storage Resource Broker
Standard for remote data:• Storage• Access
Data can be distributed across organizations and heterogeneous storage systems
Used to host data for FREERIDE-G
Access provided through client API
P. 9
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
FREERIDE-G Processing Structure
KEY observation: most data mining algorithms follow canonical loop
Middleware API: • Subset of data to be
processed• Reduction object • Local and global reduction
operations • Iterator
Derived from precursor system FREERIDE
While( ) {
forall( data instances d) {
I = process(d)
R(I) = R(I) op d
}
…….
}
P. 10
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
FREERIDE-G Evolution
FREERIDE
data stored locally
FREERIDE-G• ADR responsible for remote data retrieval• SRB responsible for remote data retrieval
FREERIDE-G grid service
Grid service featuring• Load balancing• Data integration
P. 11
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
EvolutionFREERIDE FREERIDE-G-ADR
FREERIDE-G-SRB FREERIDE-G-GT
Application
Data
ADR
SRB
Globus
P. 12
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
FREERIDE-G System Architecture
P. 13
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Implementation Challenges I
• Interaction with Code Repository– Simplified Wrapper and Interface Generator– XML descriptors of API functions– Each API function wrapped in own class
C++
Java
C++
SWIGXML
P. 14
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Implementation Challenges
Integration with MPICH-G2 and Globus Toolkit
• Supports MPI• Deployed through pre-
WS Globus components• Hides potential
heterogeneity in:– service startup – management
P. 15
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
FREERIDE-G Applications
VortexPro:• Finds vortices in
volumetric fluid/gas flow datasets
Kmeans Clustering:• Clusters points based on
Euclidean distance in an attribute space
EM Clustering:• Another distance-based
clustering algorithm
P. 16
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Experimental setup
Organizational Grid:
• Data hosted on Opteron 250 cluster
• Processed on Opteron 254 cluster
• Connected using 2 10 GB optical fibers
Goals:
• Demonstrate parallel scalability of applications
• Evaluate overhead of using MPICH-G2 and Globus Toolkit deployment mechanisms
Repository cluster
Compute cluster
P. 17
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Scalability Evaluation
Scalable implementations with respect to numbers of:
• Compute nodes,• Data repository nodes.
Vortex Detection with 14.8 GB dataset.
Kmeans Clustering with 6.4 GB dataset (bottom)
0
100
200
300
400
Execution Time (sec)
1 2 4 8
Data Repository Nodes (#)
1 compute 2 compute4 compute 8 compute
P. 18
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Scalability Evaluation II
0
200
400
600
800
Execution Time (sec)
1 2 4 8
Data Repository Nodes (#)
1 compute 2 compute4 compute 8 compute
EM Clustering (25.6 GB)
0
1000
2000
3000
Execution Time (sec)
1 2 4 8
Data Repository Nodes (#)
1 compute 2 compute4 compute 8 compute
Kmeans Clustering (25.6 GB)
Sub-linear speedup for scaled up compute nodes explained by
sequential scheduling of concurrent data reads
P. 19
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Deployment Overhead Evaluation
Clearly a small overhead associated with using Globus and MPICH-G2 for middleware deployment.
Kmeans Clustering with 6.4 GB dataset: 18-20%.
0
100
200
300
Execution Time (sec)
4 data - 4compute
4 data - 8compute
8 data - 8compute
GT No GT
P. 20
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Deployment Overhead Evaluation II
Vortex Detection with 14.8 GB dataset: 17-20%.
Overhead: • scales with the total
execution time• doesn’t effect overall data
processing service scalability
0
20
40
60
80
100
Execution Time (sec)
4 data - 4compute
4 data - 8compute
8 data - 8compute
GT No GT
P. 21
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Outline
• Motivation• Introduction• Middleware Overview• Challenges• Experimental Evaluation• Related work• Conclusion
P. 22
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Grid Computing
THE GRID Standards:• Open Grid Service Architecture• Web Services Resource Framework
Implementations:• Globus toolkit, WSRF.NET, WSRF::Lite
Other Grid Systems:• Condor and Legion,• Workflow composition systems:
– GridLab, Nimrod/G (GridBus), Cactus, GRIST
P. 23
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Data Grids
Metadata cataloging:• Artemis projectRemote data retrieval:• SRB, SEMPLAR• DataCutter, STORMStream processing midleware:• GATES• dQUOBData integration:• Automatic wrapper generation
P. 24
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Remote Data in Grid
KnowledgeGrid tool-set:• Developing datamining workflows (composition)
GridMiner toolkit:• Composition of distributed, parallel workflows
DiscoveryNet layer:• Creation, deployment, management of services
DataMiningGrid framework:• Distributed knowledge discovery
Other projects partially overlap in goals with ours
P. 25
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Conclusions
FREERIDE-G – middleware for scalable processing of remote data -- now as a grid service
• Compliance with grid and repository standards• Support for high-end processing• Ease use of parallel configurations• Hide details of data movement and caching• Scalable performance with respect to data and
compute nodes• Low deployment overheads as compared to non-
grid version
P. 26
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
SRB Data Host
SRB Master:• Connection establishment• Authentication• Forks agent to service I/OSRB Agent:• Performs remote I/O• Services multiple client API
requestsMCAT:• Catalogs data associated with
datasets, users, resources• Services metadata queries
(through SRB agent)
P. 27
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Compute Node
More compute nodes than data hosts
Each node:
1. Registers IO (from index)
2. Connects to data host
While (chunks to process)
1. Dispatch IO request(s)
2. Poll pending IO
3. Process retrieved chunks
P. 28
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
FREERIDE-G in Action
SRB Agent
SRB Agent
SRB MasterMCAT
Data Host I/O Registration
Connection establishment
While (more chunks to process)
I/O request dispatched
Pending I/O polled
Retrieved data chunks
analyzed
Compute Node
Compute Node
P. 29
Computer Science and Engineering
DataGrid LabA Middleware for Developing and Deploying Scalable
Remote Mining Services
Conversion Challenges
1. Interfacing C++ middleware with GT 4.0 (Java)• Swig to generate Java Native Interface (JNI)
wrapper
2. Integrating data mining service through WSDL interface• Encapsulate service as a resource
3. Middleware deployment using GT 4.0• Use Web Service – Grid Resource Allocation
Module for deployment