22
gLite Status Stephen Burke RAL GridPP 13 - Durham

GLite Status Stephen Burke RAL GridPP 13 - Durham

Embed Size (px)

Citation preview

Page 1: GLite Status Stephen Burke RAL GridPP 13 - Durham

gLite Status

Stephen BurkeRAL

GridPP 13 - Durham

Page 2: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

Overview

• gLite releases• gLite deployment

• WMS• DMS• R-GMA• VOMS

• Outstanding issues

• E&OE!

Page 3: GLite Status Stephen Burke RAL GridPP 13 - Durham

Releases

Page 4: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

gLite releases so far

• Release 1.0 on April 5th

– Released to meet deadline– WMS + CE + Fireman + gLite i/o + R-GMA + VOMS– AliEn, GAS and package manager gone– Several things missing or not working well

• No SE in gLite– Documentation is reasonable

• Release 1.1 on May 12th

– First versions of File Transfer Service (FTS), metadata catalogue

– Secure file catalogues– Bug fixes

Page 5: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

Future releases

• Release 1.2 should have been on June 1st

– Delayed to end of June, now expected late July• Was expected to be in LCG July release• Have gLite R-GMA and VOMS as LCG upgrades

• “Final” gLite release (2.0) for EGEE 1 by end of the year– Updated architecture/design/workplan

documents– Code freeze October (?)– Maybe a 1.3 release (August?), time is tight

Page 6: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

Timelines

March 2006

December 2

005

November 2

005

October 2

005

June 2005

End of EGEE 1

TODAY

Release 1.2

Release 2.0

Release 2.0

XmasVacatio

n

Integrated 2.0

Func. freeze

Final Report

Mid Dec.

Func. freeze

?

Consequences

• ~ 2.5 months of development left

• probably only 1 or 2 releases between 1.2 and 2.0

• Focus on consolidation of 1.2 and little improvements as requested from applications

• Very careful in introducing new services

Review

Page 7: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

Release priorities

• Driven by service challenges– Especially data management– LCG Baseline Services document

• No time to change anything for EGEE 1• EGEE PTF disbanded

– Not seen as effective– Who collects requirements?– Do non-LCG VOs have influence?

Page 8: GLite Status Stephen Burke RAL GridPP 13 - Durham

Deployment

Page 9: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

gLite deployments – JRA1

• gLite “prototype” system– Used by ARDA team, biomed, some others– Very small, basically just CERN– Not properly maintained

• JRA1 testing testbed– Was CERN, RAL and NIKHEF– Two sites + manpower added at Imperial

• One person subtracted at CERN– Still small and under-resourced– Releases are not sufficiently tested

• 928 open bugs in savannah, 84 critical• 281 “ready for test”, but no time to test!

Page 10: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

gLite deployments - LCG

• Pre-production system now being installed– ~8 sites so far – more coming

• None in UK?

– Currently a “pure” gLite system• Role seems to change from week to week!

– Partly working but many problems– Some users allowed in soon (now?)

• Production system– Various plans considered– LCG 2.6 has R-GMA and VOMS– Next steps unclear (to me at least!)

Page 11: GLite Status Stephen Burke RAL GridPP 13 - Durham

Status as of release 1.1

Page 12: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

Workload management

• Broker is a development of the EDG/LCG RB– Seems to be largely backward-compatible– Main new feature is DAGMAN (composite jobs)– Push and pull job submission– No web services

• Hybrid info system (CEMON + BDII)– Static configuration of WMS-CE relationships– Should change to R-GMA (?)

• Condor-C replaces Globus gatekeeper on CE– Several security problems– Current performance is poor

• Submissions often fail• Cryptic error messages

Page 13: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

Data Management

• First version of metadata catalogue– No command-line clients yet, MySQL only

• Fireman file catalogue– Competes with new LCG File Catalogue– Various experiment-specific solutions

• gLite i/o– Security model still under debate (delegation, file ownership)– Doesn’t yet work with dCache or DPM SRMs, only Castor!

• FTS – developed for service challenges– Point-to-point reliable file transfer– No interaction with Fireman catalogue

• No File Placement Service (FPS) yet, hence no replication!• No Data Scheduler• Interaction with WMS still under discussion

Page 14: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

R-GMA

• Should be an information system– But both LCG and gLite still use BDII

• New Service Discovery API– Still discussing service types and names

• LCG now making substantial use of R-GMA for monitoring, accounting etc– Lots of pressure to fix bugs!– Some stability problems, needs more testing

• Not ideal to test in production, but …

– Seems generally in a good state

Page 15: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

Security

• gLite VOMS server now used by LCG– Some problems with gLite installation scripts

• WMS and DMS have limited support for VOMS– SRM, Condor-C and R-GMA don’t yet

• Many test VOMS servers exist, but still not in production– Will probably need a long learning period to get the best use

of VOMS– Not a a panacea!

• Security requirements mostly still not being addressed– Most date back to the start of EDG

• Many known security vulnerabilities

Page 16: GLite Status Stephen Burke RAL GridPP 13 - Durham

Outstanding Issues

Page 17: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

General

• Error messages, logging and fault-tolerance– Still very poor

• Proposal on common error handling by Steve Fisher

• Configuration– gLite has a common config tool (python/XML)– Underlying config not unified– Still complex, fragile and error prone– Not clear if LCG will switch

• May get many layers - YAIM -> XML -> m/w specific config files?

• Monitoring– Getting better – but all from LCG, not in gLite

• Single points of failure– Still have many, but some positive movement

Page 18: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

WMS

• Job submission rate too slow– Not tested (?), but probably

no change• Failover (RB goes down ->

jobs lost)– No change so far

• Bulk job submission– Partial support via DAGs– Parameterised jobs coming

• Space management on WNs– Not being addressed

• Access to output from running jobs– Not yet

• Advance reservation– Some work, but not yet

available

• Interaction with data management (pre-staging)– Discussion but nothing yet

• CPU speed, memory etc requirements not passed to batch system– May appear in future

• Job distribution is poor (ERT etc)– Partly addressed by new

Glue schema– Still no direct support in

broker

Page 19: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

DMS

• Need a metadata solution– Much discussion, seems to

be converging• File catalogue performance,

bulk operations– Partly addressed by

Fireman, LFC– LFC seems to have better

performance but no bulk operations

• Catalogue replication– Oracle replication by LCG– gLite working towards local

catalogues• Small files

– Not being addressed

• Reliable file replication– Partly addressed by FTS,

need FPS as well• File pinning

– Not yet in SRMs or FTS• Posix file access

– May be addressed by gLite i/o

– Security model unclear• High level data

management– Not yet (wait for Data

Scheduler in 2.0)

Page 20: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

Information systems

• Not many issues!• Glue schema not ideal

– Minor update just released– Maybe new major version in ~ 1 year?

• Stability, scalability– Need to test in production - test systems

too small

Page 21: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

Security

• VO management, groups and roles– Should come with VOMS

• VO policies for CEs– Some tools (LCAS, LCMAPS)– Needs experience

• ACLs on files– Should come with gLite File

Access Service (FAS)– Not ready yet– Need to check security

model satisfied sites– No support in SRM yet

• No outbound IP access– Some discussion, nothing

yet

• Secure file management– Not needed for HEP, but

strong need for biomed– Some work, not there yet

• Quotas– Some work on

measurement– Enforcement?

• Vulnerabilities– Many known, little work– New group (Linda Cornwall)

Page 22: GLite Status Stephen Burke RAL GridPP 13 - Durham

July 6th 2005 gLite Status

Summary

• First gLite releases are out, but are buggy and incomplete

• Next release is late, not much time to the end of EGEE 1

• Many long-standing issues not addressed– Developers tend to follow their own interests rather than

user/sysadmin needs– Functionality is less than at the end of EDG!

• Probably still >~ 1 year to get production quality– OK for EGEE if EGEE 2 is approved– Mismatch with LCG timescale

• LHC experiments are building their own Grids– How much of gLite do they need?

• Who decides requirements and priorities?