10
FP6−2004−Infrastructures−6-SSA-026409 www.eu-eela.org E-infrastructure shared between Europe and Latin America EGEE Operation Procedures Alexandre Duarte CERN IT-GD-OPS

EGEE Operation Procedures

Embed Size (px)

DESCRIPTION

EGEE Operation Procedures. Alexandre Duarte CERN IT-GD-OPS. COD. COD is Operator on Duty global LCG/EGEE GRID monitoring 1 (2) ROCs responsible for the whole GRID operations at a time 12 ROCs involved weekly rotation weekly WLCG-OSG-EGEE Operations meeting ROCS, Tier1, experiments - PowerPoint PPT Presentation

Citation preview

Page 1: EGEE Operation Procedures

FP6−2004−Infrastructures−6-SSA-026409

www.eu-eela.org

E-infrastructure shared between Europe and Latin America

EGEE Operation Procedures

Alexandre Duarte

CERN IT-GD-OPS

Page 2: EGEE Operation Procedures

Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America COD• COD is Operator on Duty

• global LCG/EGEE GRID monitoring

• 1 (2) ROCs responsible for the whole GRID operations at a time– 12 ROCs involved– weekly rotation

• weekly WLCG-OSG-EGEE Operations meeting– ROCS, Tier1, experiments– all sites invited

Page 3: EGEE Operation Procedures

Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America COD Procedures• https://twiki.cern.ch/twiki/bin/view/EGEE/EGE

EROperationalProcedures

• Looking at monitoring tools– SAM, gstat, Certificate Monitoring pages

• Open tickets using COD Dasboard

• Escalate expired tickets

• Process site responses (update tickets accordingly)

• End of duty: hand-over notes

Page 4: EGEE Operation Procedures

Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America COD Dashboard• summary of necessary monitoring

information + tools for ticket processing

• tickets linked to GGUS

• GOCDB information

• SAM + gstat results

• ticket creation and management tool

• tools for related e-mail

Page 5: EGEE Operation Procedures

Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America COD Dashboard

Page 6: EGEE Operation Procedures

Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

Escalation Procedure

• defines the steps to be taken during the lifetime of a ticket

• avaliable on CIC Operations Portal– (https://edms.cern.ch/document/701575)

• distinction between sites depending on the amount of resources

Page 7: EGEE Operation Procedures

Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America Escalation Steps

1. ticket creation

2. first mail (to: site + ROC)

3. second mail (to: site + ROC)

4. suspension from the GRID

• before 4.:a) mail to ROCb) weekly operations meeting callc) mail to OMC for validation

Page 8: EGEE Operation Procedures

Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin AmericaEscalation Procedure• site categories

– low: CPU <20– normal: 20 < CPU < 100– high: 100 < CPU

• between 2.-3. and 3.-4.– low + normal: 3 days– high: 1 days

Page 9: EGEE Operation Procedures

Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin AmericaEscalation Procedure

Create ticket Close ticket

When

deadline

reachedProblem solved ?

last

escalation ?

Extend deadline

Suspend site

Escalate

mail

yes

no

no

site respondsmail mail

mail

Page 10: EGEE Operation Procedures

Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

What a site should do

• Look at the monitoring tools (SAM)– try to notice & fix failures before the CODs

• COD notification about a failure– fix it ASAP

• Scheduled downtime– announce it in advance– announce when it's finished

• problems → contact the ROC– best way: Create a ticket

• question → ask the ROC