28
Scaling Condor on XSEDE for LIGO Peter Couvares Syracuse University LIGO Scientific Collaboration

Scaling Condor on XSEDE for LIGO

  • Upload
    audi

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Scaling Condor on XSEDE for LIGO. Peter Couvares Syracuse University LIGO Scientific Collaboration. Who am I? What is LIGO?. Former Condor Team member (‘99-’08). - PowerPoint PPT Presentation

Citation preview

Page 1: Scaling Condor on XSEDE for LIGO

Scaling Condor on XSEDE for LIGO

Peter CouvaresSyracuse University

LIGO Scientific Collaboration

Page 2: Scaling Condor on XSEDE for LIGO

Who am I? What is LIGO?

• Former Condor Team member (‘99-’08).• Now at Syracuse University focused on distributed

computing problems for the LIGO Scientific Collaboration, and fostering a research computing community at SU more generally.

• LIGO (the Laser Interferometer Gravitational-Wave Observatory) is a large scientific experiment to detect cosmic gravitational waves and harness them for scientific research.

• http://ligo.org/

Page 3: Scaling Condor on XSEDE for LIGO

The Project

• The Charge:– demonstrate whether LIGO can effectively utilize

XSEDE resources for its large-scale computing. (And if not, why?)

• The Challenge:– LIGO’s existing computing model doesn’t map

perfectly to XSEDE.

Page 4: Scaling Condor on XSEDE for LIGO

Four Talks

• The political story (NSF)• The cultural story (LIGO + TACC)• The architectural story (what we did)• The technical story (how we did it)

Page 5: Scaling Condor on XSEDE for LIGO

The Political Story

• LIGO plans to buy millions of $$$ of computers later this year to be ready for the Advanced LIGO detectors as they come online.

• LIGO has always done most of its computing “in-house” on dedicated LIGO clusters, with good results – so we haven’t tried very hard (at least not lately) to utilize opportunistic resources we don’t manage ourselves*.* notable exception = E@H

Page 6: Scaling Condor on XSEDE for LIGO

The Political Story

• Before writing us a check, the NSF wanted to understand why we only planned to buy our own private clusters, when some other large NSF projects are successfully using (or contributing to) shared resources.

• Given the size of the check, the NSF also probably wanted to know whether we were doing our computing sensibly, and weren’t building something unnecessarily inefficient or eccentric.

• WARNING: this is my speculation based on fourth-hand accounts of other people’s guesses. I could be wrong.

Page 7: Scaling Condor on XSEDE for LIGO

The Cultural Story

• The NSF asked LIGO to see if it could run some or all of its large-scale computing work on XSEDE.

• Stampede was the closest thing to an HTC cluster in XSEDE, so the NSF told LIGO and TACC to work together on it.

Page 8: Scaling Condor on XSEDE for LIGO

LIGO View of XSEDE Resources

Page 9: Scaling Condor on XSEDE for LIGO

LIGO View of LIGO Computing

Page 10: Scaling Condor on XSEDE for LIGO

Shotgun Wedding

Page 11: Scaling Condor on XSEDE for LIGO

Shotgun Wedding

• LIGO: “We don’t need a car with 12 cylinders and molybdenum brakes to commute to work. These Hyundais we’ve got lined up are fine.”

• TACC: “You need how many cars?!?”

Page 12: Scaling Condor on XSEDE for LIGO

Shotgun Wedding

• LIGO: So, we just need Condor everywhere, no firewall, a bunch of yum repos and RPMs installed on all your machines, single-sign on for our users using their LIGO.ORG credentials, and the ability to run VMs as jobs.

• TACC: Uh, we don’t normally do any of that stuff. And no way are you running VM jobs.

Page 13: Scaling Condor on XSEDE for LIGO

Shotgun Wedding

• TACC: Have you optimized your code?

• LIGO: Who do they think we are, amateurs? Have we optimized our code! Harrumph!

• TACC: Here, look at these FFT results.

• LIGO: Oh. Uh… wow, that’s faster. Nice!

Page 14: Scaling Condor on XSEDE for LIGO

The Cultural Story

• Like any shotgun wedding, neither party were thrilled to be at the altar under duress.

• But we got to work, and quickly dropped the grumpiness.

• The TACC staff turn out to be great to work with, have all kinds of valuable expertise LIGO can use, and have been extremely helpful.

• Despite the impedance mismatch, together we succeeded in running a production LIGO workflow, at scale, on Stampede.

Page 15: Scaling Condor on XSEDE for LIGO

Key points of contrast between the LDG and Stampede:• Central NFS fileservers (LDG) vs. Lustre DFS (Stampede).• Persistent compute nodes w/state (LDG) vs. transient/stateless

execute nodes (Stampede).– LDG uses persistent local disk for distributed checkpointing

and Condor logging• NFS for job input and output, local scratch disks for runtime file

i/o (LDG) vs. Lustre for everything (Stampede).• Condor batch queue system (LDG) vs. SLURM (Stampede).• Scientific Linux 6.1 (LDG) vs. CentOS 6.5 (Stampede).• Software pre-installed in system locations on dedicated resources

(LDG) vs. local builds on shared resources (Stampede).• Long running jobs (LDG) vs. 48h maximum (Stampede)

15

The Architectural Story

Page 16: Scaling Condor on XSEDE for LIGO

16

Make LDG look more like Stampedeor

Make Stampede look more like LDG

Given our experience porting a large body of LDG software and workflows to new OS platforms and versions, we knew it took more time than we had, so we started with the latter and worked back the other way when it was necessary or easier.

Design Choice

Page 17: Scaling Condor on XSEDE for LIGO

An LDG Site “Overlay” on Stampede

Page 18: Scaling Condor on XSEDE for LIGO

18

• Glide-in Condor pool via SLURM– Persistent Condor central manager– Persistent login/submit machine

• Make heavy use of Condor standard universe checkpointing to handle mismatch between SLURM scheduling policies and long running analysis jobs with unpredictable runtimes.

• Pre-install LIGO software (RPMs) site-wide on Stampede.• Use LDR and Globus for data transfer to Lustre via gridftp• Setup LIGO web services in dedicated VM

– Data discovery service– LIGO.ORG protected web site to post analysis results

• Enable access to LDG with XSEDE credentials and vice-a-versa– Stampede an early XSEDE adopter of CILogon

LDG Overlay on Stampede

Page 19: Scaling Condor on XSEDE for LIGO

19

Validate the ability to transfer simulated aLIGO data from a LIGO Engineering Run to Stampede and confirm that the CBC offline detection pipeline can run and generate the same results on Stampede as the LDG. • Select one LDG site (Syracuse) for detailed comparison runs.• Start with the Initial LIGO (iLIGO) pipeline and well understood input data.• Perform correctness and scaling tests• Optimize performance• Switch to the aLIGO pipeline currently being developed• Perform longer running stability tests• In the background allow for other small scale LIGO tests

The Goal

Page 20: Scaling Condor on XSEDE for LIGO

Setup systems for testing:• Install LIGO software including Condor – PASS

– After a few iterations official releases of LIGO software from package repositories where installed on all Stampede systems.

• Setup VM for LIGO to install and manage web services – PASS– Took a few iterations including installing extra certificates and mailing

physical security tokens but straightforward.– Minor change to LIGO web services authentication configuration to handle

different network topology at TACC.• Setup 10G network transfer of LIGO data via Globus and LDR using

gridftp – PASS– Took a while to track down a performance issue due to mismatched MTU but

eventually solved.• Manually support registration of CILogon credentials before XSEDE

deployed that during the test20

Round 0

Page 21: Scaling Condor on XSEDE for LIGO

Analyze one-day of LIGO data on Stampede using iLIGO code:

• Condor glide-in via SLURM – PASS• Data transfer via LDR – PASS• Central checkpointing – FAIL• Network firewall issues – FAIL

21

Round 1

Page 22: Scaling Condor on XSEDE for LIGO

Analyze two weeks of data:

• Solved initial firewall problems – PASS• Improve security by moving Condor Central

Manager to dedicated host (VM) caused new firewall problems – FAIL

• Try solving checkpoint scaling by having parallel checkpoint writing and central resume – FAIL

22

Round 2

Page 23: Scaling Condor on XSEDE for LIGO

Analyze 6 weeks of data:

• Condor code patch to support parallel checkpoint save/restore to a shared filesystem without persistent checkpoint servers – PASS

• Scaling to 9,000 concurrent jobs with synchronous checkpiont/resume woke up TACC support team at inconvenient hour – MIX– >2000 load avg on submit node

• Moved Condor LOCK and LOG files to /dev/shm to reduce load on Submit machine (temporary solution) – PASS

• Scaling to 25k concurrent jobs hit limit of single submit machine at 13k jobs – MIX

23

Round 3

Page 24: Scaling Condor on XSEDE for LIGO

• Submit machine scalability (13k != 25k)– Several straightforward ways to solve

• Submit fewer but multi-core jobs• Split work between multiple Submit machines• Further investigate/enhance Condor Shadow scalability

• Use a factory to manage glide-ins automatically.• What happens when we don’t have a fortuitous alignment of OS?

– Virtual Machines (not supported on Stampede)– Restrict amount of needed software (focus on production rather than

development computing)– Port necessary packages as opt-in modules

• Enhance LIGO packages to be relocatable as more appropriate for a shared resource

24

Future

Page 25: Scaling Condor on XSEDE for LIGO

Lessons• It takes a lot of work to migrate a "big" computing system to a

new environment. Something has to give.• It can be done.• Miron might say we “cheated” by statically reproducing much

of our existing environment on Stampede, rather than bringing it with us – but we had a deadline and it’s a big first step.– And the cultural accomplishment inside LIGO may end up being

bigger than the technical accomplishment…• XSEDE sites like TACC have incredibly valuable expertise – you

should take advantage of it. Not being HPC-focused, we underappreciated it before this exercise.

Page 26: Scaling Condor on XSEDE for LIGO

Lessons• Speaking for myself, not LIGO:• We should have been more optimistic, and more humble, up front –

but we got there.• The NSF should be more clear about what’s going on when it arranges

this kind of thing, to limit FUD.• While LIGO must manage its own significant computing resources for

some work (e.g., low-latency analysis, detector characterization, software development, testing, and training students), we can use shared resources like Stampede today for a large fraction of our computing.

• Longer-term, LIGO should develop its “grid plumbing” to enable more flexible use of other shared resources that can’t be made to look like LDG sites as easily as Stampede.

Page 27: Scaling Condor on XSEDE for LIGO
Page 28: Scaling Condor on XSEDE for LIGO

Acknowledgements• Apologies in advance to those I surely forgot…• LIGO– Stuart Anderson, Duncan Brown, Kent Blackburn, Josh Willis,

Patrick Brady, many others• TACC– Yaakoub El Khamra, Luke Wilson, John Cazes, John McCalpin, Bill

Barth, Nathaniel Mendoza, many others• Condor– Greg Thain, Alan De Smet, many others

• NSF– Faceless bureaucrats who forced us out of our rut!