*My job was to compare Agent teamwork with 2 contemporary grid computing middleware alternatives, condor and globus.
*A condor pool is a collection of computers whose resource advertisements are managed by each pools class manager.Condor pools exchange their resource information through their gateway node. Thus, a user can deploy a job to available computing nodes allocated by his/her local class manager.
*Here is an illustration of a DUROC-based hierarchical multi-cluster job execution example.
Given an RSL-described job, DUROC deploys the job to available clusters generally controlled by their local commodity scheduler.This is a typical job execution infrastructure with MPICH-G2 that supports MPI connections over clusters.
However, when we encounter a node failure or even a cluster failure, things are difficult. DUROC needs user interaction. Since no snapshots are supported at the system level, users are responsible to resume a job from where it was crashed, including MPI rank remapping.
**My first task was to establish condor reference platform. I tested and installed condor and PVM thats required for parallel applications.
I also implemented several parallel applications using PVM for testing purposes.
Matrix Mult simple matrix multiplicationWave2D simulation simulation of waveMandelbrot setDistributed grep a text search program *I converted the same programs to MPICH for globus, and modified them to take advantage of Agent Teamworks own features.
Since the lab machines were upgraded to the new redhat, I had to make sure Globus, which was installed by a former student, would still work.
And finally, conduct performance evaluation
I also had to check the status of an existing globus installation which was not in a very organized state. The previous student who did the installation left very little documentation on how things are organized and how jobs can be run, so I had to go through the installation process again to make sure it works.
*About half way into the project, I learned from the Condor team that PVM is no longer fully supported by Condor, so we had to drop it from the project.*So, I switched my focus to evaluating the performance of Agent Teamworks fault-tolerance instead.I modified the same applications to take advantage of Agent Teamworks checkpointing features and performed some basic tests with these programs.*Challenges I encountered during testing.
It is difficult to find a large problem that can scale with the increasing number of computing nodes. Most master-slave models reach a saturation point where the communication overhead becomes the dominant factor in performance.Problem size is often limited to the master nodes memory.Needed to find a time where not many people are using the computers to perform the testsI settled on checkpointing once every 1000 iterations.*This is a pretty interesting behavior, as the number of computing nodes increases, the performance degrades. This is due to the fact that only 1 book keeper agent was used in this scenario and that seemed to have become the bottle neck. Regardless of this behavior, it still showed that the overhead caused by frequent checkpointing in Agent Teamwork is within tolerable margin.*Wave2D is simulation program that requires each rank to exchange boundary data with its neighbor. As you can see, it performs much slower than the other programs due to the extra communication overhead.*Mandelbrot set data I used scaled pretty well up to 32 computing nodes. Again, it shows solid performance of agent teamwork. **Although I am presenting now, I plan on continuing this performance evaluation to expand computation to 64 computing nodes and conduct performance on Globus and compare it with Agent Teamwork.***