Belle Data Grid Deployment … Beyond the Hype

Belle Data Grid Deployment…Beyond the Hype

Lyle WintonExperimental Particle Physics, University of Melbourne

eScience, December 2005

Lyle Winton, University of Melbourne

Belle ExperimentBelle Experiment

• Belle in KEK, Japan– Investigates symmetries in

nature– CPU and Data requirements

explosion!• 4 billion events needed

simulating in 2004 to keep up with data production

• Belle MC Production effort• Australian HPC has

contributed

• Belle’s an ideal case– has real research data– has known application

workflow– has real need for distributed

access and processing


BackgroundBackground

• The general idea…– Investigation of Grid tools (Globus v1, v2, LCG)

– Deployment to distributed testbed

– Utilisation of the APAC and partner facilites

– Deployment to the APAC National Grid


Australian Belle TestbedAustralian Belle Testbed

• Rapid deployment at 5 sites in 9 days– U.Melb. Physics + CS, U.Syd., ANU/GrangeNet, U.Adelaide CS– IBM Australia donated dual Xeon 2.6 GHz nodes

• Belle MC generation of 1,000,000 events• Simulation and Analysis• Demonstrated at

PRAGMA4 and SC2003• Globus 2.4 middleware• Data management

– Globus 2 replica catalogue– GSIFTP

• Job management– GQSched (U.Melb Physics)– GridBus (U.Melb CS)


Initial Production DeploymentInitial Production Deployment

• Custom built central job dispatcher– Initially used ssh and PBS commands– feared Grid was unreliable– then only 50% of facilities Grid accessible

• SRB (Storage ResourceBroker)– Transfer of input data

KEK → ANUSF → Facility– Transfer of output data

Facility → ANUSF → KEK

• Successfully participated inBelle’s 4x109 event MCproduction during 2004

• Now running on APAC NGusing LCG2/EGEE


IssuesIssues• Deployment

– time consuming for experts.– even more time consuming for site admins with no experience.– requires loosening security (network, unknown services, NFS on exposed

boxes)– Grid services and clients generally require public IPs with open ports

• Middleware/Globus bugs, instabilities, failures– too many to list here– errors, logs, and manuals are frequently insufficient

• Distributed management– version problems between Globus (eg. globus-url-copy can hang)– stable middleware is compiled from source – but OS upgrades can break– once installed how do we keep configured considering…

• growing numbers of users and communities (VOs)• expanding interoperable Grids (more CAs)

• Applications– installing by hand at each site– many require access to DB or remote data while processing– most clusters/facilities have private/off-internet compute nodes


IssuesIssues

• Staging work around– GridFTP is not a problem, however, SRB is more difficult– remote queues for staging (APAC NF)– front end node staging to shared FS (via jobmanager-

fork)– front end node staging via SSH

• No National CA (for a while)– started with explosion of toy CAs

• User Access Barriers– user has cert. from CA … then what?– access to facilities is more complicated

(allocation/account/VO applications)– then all the above problems start!– Is Grid worth the effort?


ObservationsObservations• Middleware

– Everything is fabric, lack of user tools!• Initially only Grid fabric (low level)

– eg. Globus2• Application level or 3rd Generation middleware

– eg. LCG/EGEE, VDT– Overarching, joining, coordinating fabric– User tools for application deployment

– Everybody must develop additional tools/portals for everyday user access (non-expert)

• No out of box solutions

• Real Data Grids!– Many international research big-science collaborations are data focused– This is not simply a staging issue!– Jobs need seamless access to data (at start, middle, end of job)

• Many site compute nodes have no external access• Middleware cannot stage/replicate databases• In some cases file access is determined at run time (ATLAS)

– Current jobs must be modified/tailored for each site – not Grid


ObservationsObservations

• Information Systems– Required for resource brokering, debugging

problems

–MDS/GRIS/BDII are often unused (eg. Nimrod/G, GridBus)• not because of the technology

• never given a certificate

• never started

• never configured for the site (PBS etc.)

• never configured to publish (GIIS or top level BDII)

• never checked


Lessons/RecommendationsLessons/Recommendations

• NEED tools to determine what's going on (debug)– jobs and scripts must have debug output/modes

– middleware debugging MUST be well documented• Error codes and messages

• Troubleshooting

• Log files

– application middleware must be coded for failure!• service death, intermittent connection failure, data removal,

proxy timeout, hangs are all to be expected

• all actions must include external retry and timeout

– information systems• eg. queue is full, application not installed, not enough memory


Lessons/RecommendationsLessons/Recommendations

• Quality and Availability are key issues• Create service regression test scripts!

– small config changes or updates can have big consequences– run from local site (tests services)– run from remote site (tests network)

• Site validation/quality checks– 1 – are all services up and accessible?– 2 – can stagein+run+stageout a baseline batch job?– 3 – do I.S. conform to minimum schema standards?– 4 – are I.S. populated, accurate, and up to date?– 5 – repeat 1-4 regularly

• Operational metrics are essential– help determine stability and usability– eventually provide justification for using Grid


Lessons/RecommendationsLessons/Recommendations• Start talking to System/Network Admins early

– education about Grid, GSI, and Globus– logging and accounting– public IPs with shared home filesystem

• Have a dedicated node manager, both OS and middleware– don't underestimate time required– installation and testing ~ 2-4 day expert, 5-10 days novice (with instruction)– maintenance (testing, metrics, upgrades) ~ 1/10 days

• Have a middleware distribution bundle– too many steps to do at each site– APAC NG hoping to solve with

Xen VM images• Automate general management tasks

– authentication lists (VO)– CA files, especially CRLs– host cert checks and imminent

expiry warnings– service up checks (auto restart?)– file clean up (GRAM logs, GASS cache?, GT4 persisted)

BADG Installersingle step, guided GT2 installation http://epp.ph.unimelb.edu.au/EPPGrid

GridMgrmanages VOs, certs, CRLs http://epp.ph.unimelb.edu.au/EPPGrid


International InteroperabilityInternational Interoperability

• HEP case study– application groups had to develop coordinated

dispatchers and adapters• researchers jumping through hoops -> in my opinion failure

– limited manpower, limited influence over implementation

– if we are serious we MUST allocate serious manpower and priority with authority over Grid infrastructure

– minimal services, same middleware, is not enough

– test case applications are essential

– operational metrics are essential


BenefitsBenefits

• Access to resources– Funding to develop expertise and for manpower– Central expertise and manpower (APAC NG)– Other infrastructure (GrangeNet, APAC NG, TransPORT

SX)• Early adoption has been important

– Initially access to more infrastructure– Ability to provide experienced feed back

• Enabling large scale collaboration– eg. ATLAS

• produces up to 10PB/year of data• 1800 people, 150+ institutes, 34 countries• Aim to provide low latency access to data with 48hrs of

production

Documents

Belle Data Grid Deployment … Beyond the Hype