Upload
gervais-shon-berry
View
217
Download
1
Embed Size (px)
Citation preview
Peter F. CouvaresAssociate Researcher, Condor Team
Computer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor
The NMI Build and Test Framework
www.cs.wisc.edu/condor
How Condor Got Started in the Build/Test Business:
Prehistory› Oracle shamed^H^H^H^H^H^Hinspired
us.› The Condor team was in the stone age,
producing modern software to help people reliably automate their computing tasks -- with our bare hands.• Every Condor release took weeks/months to do.• Build by hand on each platform, discover lots of
bugs introduced since the last release, track them down, re-build, etc.
www.cs.wisc.edu/condor
What Did Oracle Do?
› Oracle selected Condor as the resource manager underneath their Automated Integration Management Environment (AIME)
› Decided to rely on Condor to perform automated build and regression testing of multiple components for Oracle's flagship Database Server product.
› Oracle chose Condor because they liked the maturity of Condor's core components.
www.cs.wisc.edu/condor
Doh!
› Oracle used distributed computing to automate their build/test cycle, with huge success.
› If Oracle can do it, why can’t we?› Use Condor to build Condor!› NSF Middleware Initiative (NMI)
• right initiative at the right time!• opportunity to collaborate with others to do for
production software developers like Condor what Oracle was doing for themselves
• important service to the scientific computing community
www.cs.wisc.edu/condor
NMI Statement
› Purpose – to develop, deploy and sustain a set of reusable and expandable middleware functions that benefit many science and engineering applications in a networked environment
› Program encourages open source software development and development of middleware standards
www.cs.wisc.edu/condor
Why should you care? From our experience, the functionality, robustness and maintainability of a production-quality software component depends on the effort involved in building, deploying and testing the component.• If it is true for a component, it is definitely true for
a software stack• Doing it right is much harder than it appears from
the outside• Most of us had very little experience in this area
www.cs.wisc.edu/condor
Goals of theNMI Build & Test System
› Design, develop and deploy a complete build system (HW and SW) capable of performing daily builds and tests of a suite of disparate software packages on a heterogeneous (HW, OS, libraries, …) collection of platforms
› And make it:• Dependable • Traceable• Manageable • Portable • Extensible• Schedulable• Distributed
www.cs.wisc.edu/condor
The Build Challenge
› Automation - “build the component at the push of a button!”
• always more to it than just configure & make• e.g., ssh to the “right” host; cvs checkout; untar; setenv, etc.
› Reproducibility – “build the version we released 2 years ago!”
• Well-managed & comprehensive source repository• Know your “externals” and keep them around
› Portability – “build the component on nodeX.cluster.net!”• No dependencies on magic “local” capabilities• Understand your hardware & software requirements
› Manageability – “run the build daily on 20 platforms and email me the outcome!”
www.cs.wisc.edu/condor
The Testing Challenge
› All the same challenges as builds (automation, reproducibility, portability, manageability), plus:
› Flexibility• “test our RHEL4 binaries on RHEL5!”• “run our new tests on our old binaries”
• Important to decouple build & test functions• making tests just a part of a build -- instead of an
independent step -- makes it difficult/impossible to:• run new tests against old builds• test one platform’s binaries on another platform• run different tests at different frequencies
www.cs.wisc.edu/condor
“Eating Our Own Dogfood”› What Did We Do?
• We built the NMI Build & Test Lab on top of Condor, DAGMan, and other distributed computing technologies to automate the build, deploy, and test cycle.
• To support it, we’ve had to construct and manage a dedicated, heterogeneous distributed computing facility.
• Opposite extreme from typical “cluster” -- instead of 1000’s of identical CPUs, we have a handful of CPUs for each of ~40 platforms.
• Much harder to manage! You try finding a nifty system/network/cluster admin tool that works on 40 platforms!
• We’re JABCU (just another big Condor user)• If Condor sucks, we feel the pain.
www.cs.wisc.edu/condor
How does grid s/w help? › Build & Test jobs are a lot like scientific computing jobs. Same
problems...› Resource management
• Advertising machine capabilities (hw, OS, installed software, config, etc.)• Advertising job requirements (hw, OS, prereq software, config, etc.)• Matchmaking substitution -- replacing dynamic parameters in build (e.g., available
ports to use) with specifics of matched machine› Fault tolerance & reliable job results reporting!
• never ever ever have to "babysit" a build or test to deal with external failures -- submit & forget until done, even if network does down or machine reboots
• garbage collection -- we never have to clean up processes or disk droppings after a misbehaving build
› DAGMan!• make dependencies explicit in a DAG, and get the same fault tolerance & reliability
› Data management, file xfer, etc.• no shared filesystem! -- we need to make sure build/test node gets the files it
needs from the submit machine, and gets the results back› Authentication› "gateway to the grid" -- grid resource access
• in theory we can build/test on any remote grid using resources we don't manage (e.g., ANL, OMII, SDSC, NCSA machines)
NMI Build & Test Facility
MySQL Results DB
Web Portal
FinishedBinaries
Customer Source Code
Condor Queue
NMI Build& Test
Software
Customer Build/Test
Scripts
INPUTO
UT
PU
T
Distributed Build/Test Pool
Spec
File
Spec
File
DAGMan
DAG results
build/testjobs
DAG
results
results
www.cs.wisc.edu/condor
Numbers
Name Arch OS
1 atlantis.mcs.anl.gov sparc sol92 grandcentral i386 rh9
3 janet i386 winxp4 nmi-build15 i386 rh72
5 nmi-build16 i386 rh86 nmi-build17 i386 rh9
7 nmi-build18 sparc sol98 nmi-build21 i386 fc2
9 nmi-build29 sparc sol810 nmi-build33 ia64 sles8
11 nmi-build5 i386 rhel312 nmi-build6 G5 osx
13 nmi-rhas3-amd64 amd64 rhel314 nmi-sles8-amd64 amd64 sles8
15 nmi-test-3 i386 rh916 nmi-test-4 i386 rh9
17 [unknown] hp hpux1118 [unknown] sgi irix6?
19 [unknown] sparc sol1020 [unknown] sparc sol7
21 [unknown] sparc sol822 [unknown] sparc sol9
23 nmi-build1 i386 rh924 nmi-build14 ppc aix52
25 nmi-build24 i386 tao126 nmi-build31 ppc aix52
27 nmi-build32 i386 fc328 nmi-build8 ia64 rhel3
29 nmi-dux40f alpha dux430 nmi-hpux11 hp hpux11
31 nmi-ia64-1 ia64 sles832 nmi-sles8-ia64 ia64 sles8
33 rebbie i386 winxp34 rocks-{122,123,124}.sdsc.edu i386 ???
35 supermicro2 i386 rhel436 b80n15.sdsc.edu ppc aix51
37 imola i386 rh938 nmi-aix ppc aix52
39 nmi-build2 i386 rh840 nmi-build3 i386 rh72
41 nmi-build4 i386 winxp42 nmi-build7 G4 osx
43 nmi-build9 ia64 rhel344 nmi-hpux hp hpux10
45 nmi-irix sgi irix6546 nmi-redhat72-build i386 rh72
47 nmi-redhat72-dev i386 rh7248 nmi-redhat80-ia32 i386 rh8
49 nmi-rh72-alpha alpha rh7250 nmi-solaris8 sparc sol8
51 nmi-solaris9 sparc sol952 nmi-test-1 i386 rh9
53 nmi-tru64 alpha dux5154 vger i386 rh73
55 monster i386 rh956 nmi-test-5 i386 rh9
57 nmi-test-6 i386 rh958 nmi-test-7 i386 rh9
59 nmi-build22 i38660 nmi-build25 i386
61 nmi-build26 i38662 nmi-build27 i386
63 nmi-fedora i386 fc2
100+ CPUs 40+ HW/OS “Platforms” 34+ OS
9 HW Arch3 Sites
~100 GB of results per day~1400 Builds/tests per month~350 Condor jobs per day
www.cs.wisc.edu/condor
Condor Build & Test› Automated Condor Builds
• Two (sometimes three) separate Condor versions, each automatically built using NMI on 13-17 platforms nightly
• Stable, developer, special release branches
› Automated Condor Tests• Each nightly build’s output becomes the input to a
new NMI run of our full Condor test suite
› Ad-Hoc Builds & Tests• Each Condor developer can use NMI to submit ad-hoc
builds & tests of their experimental workspaces or CVS branches to any or all platforms
www.cs.wisc.edu/condor
More Condor Testing Work
• Advanced Test Suite• Using binaries from each build, we deploy an entire
self-contained Condor pool on each test machine• Runs a battery of Condor jobs and tests to verify
critical features• Currently >150 distinct tests
• each executed for each build, on each platform, for each release, every night
• Flightworthy Initiative• Ensuring continued “core” Condor scalability, robustness• NSF funded, like NMI• Producing new tests all the time
www.cs.wisc.edu/condor
NMI Build & Test Customers
› NMI Build & Test Facility was built to serve all NMI projects
› Who else is building and testing?• Globus• NMI Middleware Distribution
• many “grid” tools, including Condor & Globus
• Virtual Data Toolkit (VDT) for the Open Science Grid (OSG)
• 40+ components
• Soon TeraGrid, NEESgrid, others…
www.cs.wisc.edu/condor
Recent Experience:SRB Client
› Storage Resource Broker (SRB)› work done by Wayne Schroeder @ SDSC› started gently; took a little while for Wayne to warm up
to the system• ran into a few problems with bad matches before
mastering how we use prereqs• Our challenge: better docs, better error messages
• emailed Tolya with questions, Tolya responded “to shed some more general light on the system and help avoid or better debug such problems in the future”
› soon he got pretty comfortable with the system• moved on to write his own glue scripts• expanded builds to 34 platforms (!)
www.cs.wisc.edu/condor
SRB Client
› But… couldn't get HP/UX build to work• at first we all thought it was a B&T system problem• once we looked closer Wayne realized that SRB in fact
would not build there, so it was informative
› Now with “one button” Wayne can test his SRB client build any time he wants, on 34 platforms, with no babysitting.
www.cs.wisc.edu/condor
Build & Test Beyond NMI
› We want to integrate with other, related software quality projects, and share build/test resources...• an international (US/Europe/China) federation of
build/test grids…• Offer our tools as the foundation for other B&T
systems• Leverage others’ work to improve out own B&T
service
www.cs.wisc.edu/condor
OMII-UK• Integrating software from multiple sources
• Established open-source projects• Commissioned services & infrastructure
• Deployment across multiple platforms• Verify interoperability between platforms & versions• Automatic Software Testing vital for the Grid
• Build Testing – Cross platform builds• Unit Testing – Local Verification of APIs• Deployment Testing – Deploy & run package • Distributed Testing – Cross domain operation• Regression Testing – Compatibility between versions• Stress Testing – Correct operation under real loads
• Distributed Testbed• Need a breadth & variety of resources not power• Needs to be a managed resource – process
www.cs.wisc.edu/condor
Next: ETICS
Build system, software
configuration, service infrastructure,
dissemination, EGEE, gLite, project coord.
Software configuration, service infrastructure, dissemination
Web portals and tools, quality process, dissemination, DILIGENT
Test methods and metrics, unit testing tools, EBIT
NMI Build & Test Framework, Condor, distributed testing tools, service infrastructure
www.cs.wisc.edu/condor
ETICS Project Goals› ETICS will provide a multi-platform environment for
building and testing middleware and applications for major European e-Science projects
› “Strong point is automation: of builds, of tests, of reporting, etc. The goal is to simplify life when managing complex software management tasks”
• One button to generate finished package (e.g., RPMs) for any chosen component
› ETICS is developing a higher-level web service and DB to generate B&T jobs -- and use multiple, distributed NMI B&T Labs to execute & manage them
• This work complements the existing NMI Build & Test system and is something we want to integrate & use to benefit other NMI users!
www.cs.wisc.edu/condor
OMII-Japan• What They’re Doing
• “…provide service which can use on-demand autobuild and test systems for Grid middlewares on on-demand virtual cluster. Developers can build and test their software immediately by using our autobuild and test systems”
• Underlying B&T Infrastructure is NMI Build & Test Software
www.cs.wisc.edu/condor
This was a Lot of Work… But It Got Easier Each
Time› Deployments of the NMI B&T Software with international collaborators taught us how to export Build & Test as a service.
› Tolya Karp: International B&T Hero• Improved (i.e., wrote) NMI install scripts• Improved configuration process• Debugged and solved a myriad of details that
didn’t work in new environments
www.cs.wisc.edu/condor
What We Don’t Do Well
› Documentation• much better than ~6 months ago, but still
incomplete• most existing users were walked through
the system in person, and given lots of support
• Submission/Specification API• we’re living comfortably in the 80’s: all
command-line, all the time• we hope ETICS will improve this!
www.cs.wisc.edu/condor
New Condor+NMI Users
› Yahoo• First industrial user to deploy NMI
B&T Framework to build/test custom Condor contributions
› Hartford Financial• Deploying it as we speak…
www.cs.wisc.edu/condor
What’s to Come› More US & international collaborations• More Industrial User/Developers…› New Features
• Becky Gietzel: parallel testing!• Major new feature: multiple co-scheduled resources for
individual tests• Going beyond multi-platform testing to cross-platform
parallel testing
› UW-Madison B&T Lab: ever more platforms• “it’s time to make the doughnuts”
• Questions?