CSS497 Undergraduate Research

CSS497 Undergraduate Research

Performance Comparison Among Agent Teamwork, Globus and

Condor

By Timothy Chuang

Advisor: Professor Munehiro Fukuda

Overview Agent Teamwork – deployment of mobile agents

Agents launch, monitor and resume jobs Fault-tolerant

Condor – opportunist job dispatcher Condor daemon searches for idle computing nodes on

which to dispatch jobs Emphasize on job migration upon encountering an error

Globus – widely used grid computing middleware MPICH is required for parallel applications

Condor

User

Condor Pool X

Gateway

Gateway

GatewayClass ManagerClass Manager

Snapshot

Class Manager

Globus

LFS PBS GRAMs

DUROC/MPICH-G2

User

Agent Teamwork

FTPServer

UserA

UserB

UserB

snapshotsnapshot

snapshots snapshots

User program wrapper

SnapshotMethods

GridTCP


SnapshotMethods

GridTCP


SnapshotMethods

GridTCP

snapshot

User A’sProcess

User A’sProcess

User B’sProcess

TCPCommunication

Commander Agent

Commander Agent

Sentinel Agent

Sentinel Agent

Resource Agent

Sentinel Agent

Resource Agent

Bookkeeper Agent

BookkeeperAgent

ResultsResults

Project Objectives Establish reference platform

Condor Installation PVM installation

Implement parallel applications to run on PVM Matrix Multiplication Wave2D Simulation Mandelbrot Set Simulation Distributed Grep

Modify parallel the same applications to utilize Agent Teamwork’s check pointing feature

Check previous Globus status Convert the same parallel applications to MPICH-G2

Conduct performance evaluation

Problems with Condor/PVM Condor no longer fully Supports PVM

PVM universe to dispatch jobs in is no longer functional

As a result, condor was dropped from the project

Evaluation of Agent Teamwork’s Fault-tolerance Performance Applications used

Matrix Multiplication Mandelbrot Set Renderer Wave2D Simulation Distributed Grep

Fault-tolerance Performance Evaluate the extra overhead of checkpointing and

resumption

Challenges Finding a large problem set that can scale well

with the increasing number of computing nodes Certain problem sizes are limited to the master node’s

memory – Matrix Multiplication

Debugging parallel applications Requires going through time consuming diagnosis

Finding the best check-pointing frequency for all applications Setting the frequency too low could take up to three

hours to finish a job!

Performance - MatrixMult

Performance – Wave2D

Performance – Mandelbrot

Performance – Distributed Grep

Continued Work Scale problem size to utilize all 64 computing

nodes Conduct performance evaluation on multi-clusters

Conduct performance evaluation on Globus Compare Globus’ performance with Agent Teamwork

Useful Classes CSS301 – Technical Writing CSS343 – Data Structures and Algorithms CSS430 – Operating Systems CSS432 – Network Design CSS434 – Parallel and Distributed Computing

AcknowledgementsMy Faculty Advisor:

Professor Munehiro Fukuda

UWB Linux System Administrators:

Mr. David Grimmer

Mrs. Meryll Larkin

My Sponsor:

Mr. Joshua Phillips

Questions?

Documents

CSS497 Undergraduate Research