CSS497 Undergraduate Research

  • View
    17

  • Download
    0

Embed Size (px)

DESCRIPTION

CSS497 Undergraduate Research. Performance Comparison Among Agent Teamwork, Globus and Condor By Timothy Chuang Advisor: Professor Munehiro Fukuda. Overview. Agent Teamwork deployment of mobile agents Agents launch, monitor and resume jobs Fault-tolerant - PowerPoint PPT Presentation

Transcript

  • CSS497 Undergraduate ResearchPerformance Comparison Among Agent Teamwork, Globus and CondorBy Timothy ChuangAdvisor: Professor Munehiro Fukuda

  • OverviewAgent Teamwork deployment of mobile agentsAgents launch, monitor and resume jobsFault-tolerantCondor opportunist job dispatcherCondor daemon searches for idle computing nodes on which to dispatch jobsEmphasize on job migration upon encountering an errorGlobus widely used grid computing middlewareMPICH is required for parallel applications

  • CondorCondor Pool X

  • GlobusLFSPBSGRAMsDUROC/MPICH-G2

  • Agent TeamworkUser program wrapperSnapshotMethodsGridTCPUser AsProcessUser AsProcessUser BsProcessTCPCommunicationCommander AgentCommander AgentSentinel AgentSentinel AgentResource AgentSentinel AgentResource AgentBookkeeper AgentBookkeeperAgent

  • Project ObjectivesEstablish reference platformCondor InstallationPVM installationImplement parallel applications to run on PVMMatrix MultiplicationWave2D SimulationMandelbrot Set SimulationDistributed Grep

  • Modify parallel the same applications to utilize Agent Teamworks check pointing featureCheck previous Globus statusConvert the same parallel applications to MPICH-G2Conduct performance evaluation

  • Problems with Condor/PVMCondor no longer fully Supports PVMPVM universe to dispatch jobs in is no longer functionalAs a result, condor was dropped from the project

  • Evaluation of Agent Teamworks Fault-tolerance PerformanceApplications usedMatrix MultiplicationMandelbrot Set RendererWave2D SimulationDistributed GrepFault-tolerance PerformanceEvaluate the extra overhead of checkpointing and resumption

  • ChallengesFinding a large problem set that can scale well with the increasing number of computing nodesCertain problem sizes are limited to the master nodes memory Matrix MultiplicationDebugging parallel applicationsRequires going through time consuming diagnosisFinding the best check-pointing frequency for all applicationsSetting the frequency too low could take up to three hours to finish a job!

  • Performance - MatrixMult

  • Performance Wave2D

  • Performance Mandelbrot

  • Performance Distributed Grep

  • Continued WorkScale problem size to utilize all 64 computing nodesConduct performance evaluation on multi-clustersConduct performance evaluation on GlobusCompare Globus performance with Agent Teamwork

  • Useful ClassesCSS301 Technical WritingCSS343 Data Structures and AlgorithmsCSS430 Operating SystemsCSS432 Network DesignCSS434 Parallel and Distributed Computing

  • AcknowledgementsMy Faculty Advisor:Professor Munehiro FukudaUWB Linux System Administrators:Mr. David GrimmerMrs. Meryll LarkinMy Sponsor:Mr. Joshua Phillips

  • Questions?

    *My job was to compare Agent teamwork with 2 contemporary grid computing middleware alternatives, condor and globus.

    *A condor pool is a collection of computers whose resource advertisements are managed by each pools class manager.Condor pools exchange their resource information through their gateway node. Thus, a user can deploy a job to available computing nodes allocated by his/her local class manager.

    *Here is an illustration of a DUROC-based hierarchical multi-cluster job execution example.

    Given an RSL-described job, DUROC deploys the job to available clusters generally controlled by their local commodity scheduler.This is a typical job execution infrastructure with MPICH-G2 that supports MPI connections over clusters.

    However, when we encounter a node failure or even a cluster failure, things are difficult. DUROC needs user interaction. Since no snapshots are supported at the system level, users are responsible to resume a job from where it was crashed, including MPI rank remapping.

    **My first task was to establish condor reference platform. I tested and installed condor and PVM thats required for parallel applications.

    I also implemented several parallel applications using PVM for testing purposes.

    Matrix Mult simple matrix multiplicationWave2D simulation simulation of waveMandelbrot setDistributed grep a text search program *I converted the same programs to MPICH for globus, and modified them to take advantage of Agent Teamworks own features.

    Since the lab machines were upgraded to the new redhat, I had to make sure Globus, which was installed by a former student, would still work.

    And finally, conduct performance evaluation

    I also had to check the status of an existing globus installation which was not in a very organized state. The previous student who did the installation left very little documentation on how things are organized and how jobs can be run, so I had to go through the installation process again to make sure it works.

    *About half way into the project, I learned from the Condor team that PVM is no longer fully supported by Condor, so we had to drop it from the project.*So, I switched my focus to evaluating the performance of Agent Teamworks fault-tolerance instead.I modified the same applications to take advantage of Agent Teamworks checkpointing features and performed some basic tests with these programs.*Challenges I encountered during testing.

    It is difficult to find a large problem that can scale with the increasing number of computing nodes. Most master-slave models reach a saturation point where the communication overhead becomes the dominant factor in performance.Problem size is often limited to the master nodes memory.Needed to find a time where not many people are using the computers to perform the testsI settled on checkpointing once every 1000 iterations.*This is a pretty interesting behavior, as the number of computing nodes increases, the performance degrades. This is due to the fact that only 1 book keeper agent was used in this scenario and that seemed to have become the bottle neck. Regardless of this behavior, it still showed that the overhead caused by frequent checkpointing in Agent Teamwork is within tolerable margin.*Wave2D is simulation program that requires each rank to exchange boundary data with its neighbor. As you can see, it performs much slower than the other programs due to the extra communication overhead.*Mandelbrot set data I used scaled pretty well up to 32 computing nodes. Again, it shows solid performance of agent teamwork. **Although I am presenting now, I plan on continuing this performance evaluation to expand computation to 64 computing nodes and conduct performance on Globus and compare it with Agent Teamwork.***