9
Becky Gietzel Computer Sciences Department University of Wisconsin-Madison [email protected] Using the Parallel Universe beyond MPI

Using the Parallel Universe beyond MPI

Embed Size (px)

DESCRIPTION

Using the Parallel Universe beyond MPI. Parallel Universe applications using Metronome. Metronome’s support for running parallel jobs builds on Condor’s Parallel Universe Possible to run coordinated Metronome jobs on multiple machines at the same time with available communication between them - PowerPoint PPT Presentation

Citation preview

Page 1: Using the Parallel Universe beyond MPI

Becky GietzelComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]

Using the Parallel Universe beyond MPI

Page 2: Using the Parallel Universe beyond MPI

www.cs.wisc.edu/~bgietzel

Parallel Universe applications using

Metronome Metronome’s support for running parallel

jobs builds on Condor’s Parallel Universe Possible to run coordinated Metronome

jobs on multiple machines at the same time with available communication between them

Provides advanced testing opportunities Some examples: client/server, cross-

platform, compatibility, stress/scalability

Page 3: Using the Parallel Universe beyond MPI

www.cs.wisc.edu/~bgietzel

Service testing challenges

Starting multiple services on the same machine does not allow for testing across a network or different platforms

Deciding when to start the services and when to start tests requires human intervention

Setup of the services is usually a manual process, or don’t bother testing.

Same goes for the teardown of services to return the machines to their original state

Page 4: Using the Parallel Universe beyond MPI

www.cs.wisc.edu/~bgietzel

Benefits of using Metronome

Condor manages dynamic claiming of resources, communication between job nodes and cleaning up after the jobs run

Metronome publishes basic information about each task to the job ad where it’s accessible by any node, acting as a “scratch space” for the job

The hostnames of all job nodes, the start time, return code, and end time for each task on each node are published to this shared job ad

This information is useful for communication between nodes and synchronization in the user’s glue scripts.

Page 5: Using the Parallel Universe beyond MPI

www.cs.wisc.edu/~bgietzel

Client/server test example

Submit Node

Execute Node 0

Execute

Node 1

Parallel Job

Start server

Send port to client

Handle client requests

Poll for ALLDONE from client

Exit

Discover server hostname and portStart client

Run queries against server

Send ALLDONE message to server

Exit

SERVER

CLIENT

Page 6: Using the Parallel Universe beyond MPI

www.cs.wisc.edu/~bgietzel

How to submit a parallel job in Metronome

Several minor modifications to the Metronome submit file are necessary for parallel jobs

List of platforms is comma separated with parentheses around the outside

Platforms = (x86_rhas_3, x86_rhas_4)

Page 7: Using the Parallel Universe beyond MPI

www.cs.wisc.edu/~bgietzel

Parallel job submit files continued

Add a glue script for each task/node combination to be executed remotely.

› platform_pre_0 = client/platform_pre

› platform_pre_1 = server/platform_pre

› remote_declare_0 = client/remote_declare

› remote_declare_1 = server/remote_declare

› remote_task_0 = client/remote_task

› remote_task_1 = server/remote_task

› remote_task_args_0 = 9000

› remote_task_args_1 = 9001

… and so forth for all glue scripts.

Page 8: Using the Parallel Universe beyond MPI

www.cs.wisc.edu/~bgietzel

Other parallel job use cases

Cross platform testing (Linux to Solaris)

Scalability/stress testing (1 server, many clients)

Compatibility testing (cross version, stable vs. development series)

Page 9: Using the Parallel Universe beyond MPI

www.cs.wisc.edu/~bgietzel

For more information

Documentation is available on the NMI site

See http://nmi.cs.wisc.edu/node/1001 for information on running parallel jobs using Metronome

http://nmi.cs.wisc.edu/node/282 describes how to set up your own Metronome installation for running parallel jobs