Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 1

glideinWMS training

Solving Grid problemsthrough glidein monitoring

i.e. The Grid debugging part of G.Factory operations

by Igor Sfiligoi (UCSD)


Glidein Factory Operations

● Factory node operations● Serving VO Frontend Admin requests● Keeping up with changes in the Grid

● Debugging Grid problems● The most time consuming part● Effectively we help solve Grid problems,

through glidein monitoring


Reminder - Glideins

● A glidein is a properly configured Condor startd daemon submitted as a Grid job

Factory

Frontend

CE

Submit node

Central manager

Worker node

glideinMonitorCondor

Requestglideins

Submitglideins

MatchStartd

Job


What can go wrong in the Grid?

● Many places where thing can go wrong● Essentially at any of the arrows below

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job



● In particular● CE may refuse to accept glideins

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job



● In particular● CE may not start glideins● Or fail to tell us what

the status of the job is

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job



● In particular● The worker node may be broken/misconfigured

– Thus validationwill fail

● Manyreasons

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job



● In particular● The WAN networking may not work properly● The CM never hears

from the Startd● Or Schedd

cannot talk to Startd

● Can be selectiveFactory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job



● In particular● Or the security infrastructure could be broken

– CAs missing– Time discrepancies– Etc.

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job



● In particular● The site may refuse to start the user job

– e.g. glexec

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job


What can go wrong with glideins?

● And there are also non-Grid problems● Jobs not matching

● But that'sbeyondthe scope of thisdocument

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job


Problem classification

● Most often we see WN problems● Followed by CEs refusing glideins

● Then there are misbehaving CEs● Very hard to diagnose!

● Everything else quite rare● But usually hard to diagnose as well

Typically easyto diagnose


Grid debugging

Validation problemsi.e. Problems on Worker Nodes


WN problems

● The glidein startup script runs a list of validation scripts● If any of them fails, the WN is considered broken● This way user jobs never get to broken WNs

● Two sources of tests● Glidein Factory● VO Frontend

● Of course, if the validation script cannot be fetched from either Web server, it is considered a failure as well


Types of tests

● The glideinWMS SW comes with a set of standard tests (provided by the factory):● Grid environment present (e.g. CAs)● Some free disk on $PWD and on /tmp● Enough FE-provided proxy lifetime remaining● gLExec related tests● OS type

● Each VO may have its own needs, e.g.:● Is VO SW pre-installed and accessible?


Discovering the problems

● Any error message printed out by the validation script will be delivered back to the factory● After the glidein terminates

● Most validation scripts provide clear indication what went wrong● And we strive to get all to do it!

● New machine readable format being introduced● With v2_6_2


Typical ops

● Noticing that a large fraction of glideins for a site are failing is easy● Just look at the monitoring● And we are getting a daily email as well

● Discovering what exactly is broken not too difficult either● Just parse the logs● Will get even easier when all scripts

return machine readable information

With appropriate tools


Action items

● Not much we can do directly● Typically, we open a ticket with the site

● Provide the list of nodes where it happens(rare to have the whole site broken)

● A concise but complete error report essential for a speedy resolution

● In minority of cases we have to contact the VO FE admin, e.g.● Unclear error messages● Non-WN specific validation errors

Unless this is the result of a

misconfigurationon our part


Black hole nodes

● There is one further WM problem● Black hole WNs● WNs that accept glidein jobs, but don't execute them

● glidein_startup never has the chance to log anything● Not even the node it is running on● Thus, empty log files!

● We can infer we have a black hole node at a site by looking at job timing (in Condor-G logs)● Good jobs run for at least 20 mins


Grid debugging

CE refusing the glideins


CE Refusing the glideins

● CE admin has the right to refuse anyone● But usually does not change his mind overnight● First time accessing a site an issue on its own

– Not covered here

● When things go wrong, the typical reason is● CE service down,● Problems in the Security/Auth infrastructure,● CE seriously misconfigured/broken


Expected vs Unexpected

● Some “problems” are expected● e.g. the CE is down for scheduled maintenance● Nothing to do in this case!

– Just a monitoring issue● So, checking the maintenance DB important!

● If not, we have to notify the site● The VO FEs are not getting the CPU slots

they are asking for


Discovering the problem

● Condor-G reacts in two different ways● Does nothing – We still have monitoring showing

the job did not progress from Waiting→Pending● Puts the job on Hold

● The G.Factory will react on Held jobs● Releasing them a few time → Condor-G retries● Removing them after a while

– Just to be replaced with identical glideins

For most non-trivial problemsthe problem does not solve by itself


Action items(for unexpected problems)

● Most of the time, not much we can do directly● Will just open ticket with site● If any useful info in the HoldReason, we pass it on● DN of the proxy the most valuable info

● But it could be our problem, too● Found many Condor-G problems in the past● Comparing the behavior of many G.Factory

instances can confirm or exclude thisAh-hoc solutions neededif this is the case


Grid debugging

CE not properlyhandling the glideins


Problematic CE

● Three basic types of problems:● Glideins not starting● Improper monitoring information● Output files not being delivered to client

● And there is two more● Unexpected policies that kill glideins


Glideins not starting

● The CE scheduling policy is not available to us● So often not obvious if we are just low priority or

something else is going on● GF/Condor-G does not see it as an error condition

● We usually don't act on it, unless● The VO FE admin complains, or● We have been given explicit guidance of the

expected startup rates

● Not much for us to investigate● Just tell the site admin “Jobs are not starting”


Glideins being killed by the site

● Ideally, our glideins should fit within the policies of the site● But sometimes they don't● So they get killed hard

● Discovering this from our side very hard● We often just notice empty log files● Not an error for Condor-G● Often learn of this because the VO complains

● If and when we understand the problem,we can deal with it ourselves● i.e. we config the glideins to stay within the limits

But getting this infois not trivial, remember?


Preemption

● Some site will preempt our glideins if higher priority jobs get into the queue● Effectively killing our glideins

● Not an actual error● Sites have the right to do it!

● But it can mess up with our monitoring/ops● We may see killed glideins, or● We may see glideins that seem to run for

a very long time (when automatically rescheduled on the CE)

● We have to efficiently filter these events out


Improper monitoring info from CE

● A CE may not provide reliable information● Each VO FE provides us with monitoring

information about its central manager● By comparing what it tells us, with what

the CE tells us, we can infer if there are problems● A large, consistent discrepancy typically signals

problems in the CE monitoring● Very difficult to figure out what is going on

● We have no direct detailed data to act upon● Mostly ad-hoc detective work, prodding the black box● Often inconclusive


Lack of output files

● The glidein output files contain● Accounting information● Detail logging

● Without other problems, mostly an annoyance● But much more often paired with glideins failing

● Making failure diagnostics close to impossible● Extremely hard to diagnose the root cause

● Sometimes we may infer it (black holes, killed glideins, ...)● For actual CE problems it requires help from many

parties, including us, the site admins and SW developers


Grid debugging

Networking problems


Glideins are network heavy

● Each glidein opens several long‑lived TCP connections (in CCB mode)● Can overwhelm networking gear

– e.g. NATs can run out of spare ports

● Problems can have non-linear behavior● Will work fine on small scale● Will degrade after a while

– Not necessarily a step function, thoughAlthough straight out

denials due to firewallsare also a problem


Diagnostics and action items

● Not trivial to detect● Errors often in the glidein logs● But difficult to interpret

● Not much we can do directly● A problem between the VO services and the site

– So we notify both

● However● we usually end up assisting as experts

And we are lackingtools for automatically

detecting this.


Grid debugging

Authentication problems


Security is delicate stuff

● Grid security mechanisms paranoid by design● “Availability” is the last to be considered● The main focus is keeping the “bad guys” out

● So they are extremely delicate● If any piece of the chain breaks, everything breaks

● Things that can go wrong (non exhaustive list):● Missing CA(s)

● Expired CRLs● Expired glidein proxy● Wrong system time (clock skew)


Diagnostics and action items

● Finding the root cause usually hard● Errors are in the glidein logs● But usually do not provide enough info

(to avoid giving up too much info to a hypothetical attacker)

● Have to distinguish between site problems and VO problems, too● Only obvious if only a fraction fails (→ WN problem)● Else, may need to get both sides involved to

properly diagnose the root cause

And we are lackingtools for automatically

detecting this.


Grid debugging

Job startup problems


gLExec (1)

● The biggest source of problems, by far,is gLExec refusing to accept a user proxy● Resulting in jobs not starting● BTW, Condor is not good at handling gLExec denials

● We can only partially test gLExec during validation● May behave differently based on the proxy used● Its behavior can change in time

● And final users may be the source of the problem● e.g. by letting the proxy expire Condor could catch

these, and hopefullysoon will


gLExec (2)

● Non trivial to detect● Errors are in the glidein logs● But we miss the tools to extract them

● Finding the root cause impossible without site admin help● gLExec policies are a site secret● We thus just notify the site,

providing the failing user DN


Configuration problems

● Condor can be configured to run a wrapper around the user job● To customize the user environment● Usually provided by the VO FE

● If that fails, the user job fails with it● Luckily, failures are rare

● If we notice them, we notify the VO FE admins● However, they often notice before we do


Other job startup problems

● By default, we validate the node only at glidein startup● WN conditions may change by the time a job

is scheduled to run– e.g. the disk fills up

● The errors are usually only seen by the final users● So we hardly ever notice

these kind of problems

We should do better.Condor supports

periodic validationtests, we just don'tuse them right now.


Summary

● The Grid world is a good approximation of a chaotic system● There are thus many failure modes

● The pilot paradigm hides most of the failures from the final users● But the failures are still there● Resulting in wasted/underused CPU cycles

● The G.Factory operators are in the best position to diagnose the root cause of the failures● By having a global view● However, they cannot solve the problems by themselves


Acknowledgments

● This document was sponsored by grants from the US NSF and US DOE,and by the UC system

Technology

Solving Grid problems through glidein monitoring