Upload
igor-sfiligoi
View
165
Download
0
Tags:
Embed Size (px)
DESCRIPTION
This document presents how Glidein Factory operations help solving problems that develop on Grid resources.
Citation preview
glideinWMS training Grid debugging 1
glideinWMS training
Solving Grid problemsthrough glidein monitoring
i.e. The Grid debugging part of G.Factory operations
by Igor Sfiligoi (UCSD)
glideinWMS training Grid debugging 2
Glidein Factory Operations
● Factory node operations● Serving VO Frontend Admin requests● Keeping up with changes in the Grid
● Debugging Grid problems● The most time consuming part● Effectively we help solve Grid problems,
through glidein monitoring
glideinWMS training Grid debugging 3
Reminder - Glideins
● A glidein is a properly configured Condor startd daemon submitted as a Grid job
Factory
Frontend
CE
Submit node
Central manager
Worker node
glideinMonitorCondor
Requestglideins
Submitglideins
MatchStartd
Job
glideinWMS training Grid debugging 4
What can go wrong in the Grid?
● Many places where thing can go wrong● Essentially at any of the arrows below
Factory
CE
Submit node
Central manager
Worker node
glidein
Startd
Job
glideinWMS training Grid debugging 5
What can go wrong in the Grid?
● In particular● CE may refuse to accept glideins
Factory
CE
Submit node
Central manager
Worker node
glidein
Startd
Job
glideinWMS training Grid debugging 6
What can go wrong in the Grid?
● In particular● CE may not start glideins● Or fail to tell us what
the status of the job is
Factory
CE
Submit node
Central manager
Worker node
glidein
Startd
Job
glideinWMS training Grid debugging 7
What can go wrong in the Grid?
● In particular● The worker node may be broken/misconfigured
– Thus validationwill fail
● Manyreasons
Factory
CE
Submit node
Central manager
Worker node
glidein
Startd
Job
glideinWMS training Grid debugging 8
What can go wrong in the Grid?
● In particular● The WAN networking may not work properly● The CM never hears
from the Startd● Or Schedd
cannot talk to Startd
● Can be selectiveFactory
CE
Submit node
Central manager
Worker node
glidein
Startd
Job
glideinWMS training Grid debugging 9
What can go wrong in the Grid?
● In particular● Or the security infrastructure could be broken
– CAs missing– Time discrepancies– Etc.
Factory
CE
Submit node
Central manager
Worker node
glidein
Startd
Job
glideinWMS training Grid debugging 10
What can go wrong in the Grid?
● In particular● The site may refuse to start the user job
– e.g. glexec
Factory
CE
Submit node
Central manager
Worker node
glidein
Startd
Job
glideinWMS training Grid debugging 11
What can go wrong with glideins?
● And there are also non-Grid problems● Jobs not matching
● But that'sbeyondthe scope of thisdocument
Factory
CE
Submit node
Central manager
Worker node
glidein
Startd
Job
glideinWMS training Grid debugging 12
Problem classification
● Most often we see WN problems● Followed by CEs refusing glideins
● Then there are misbehaving CEs● Very hard to diagnose!
● Everything else quite rare● But usually hard to diagnose as well
Typically easyto diagnose
glideinWMS training Grid debugging 13
Grid debugging
Validation problemsi.e. Problems on Worker Nodes
glideinWMS training Grid debugging 14
WN problems
● The glidein startup script runs a list of validation scripts● If any of them fails, the WN is considered broken● This way user jobs never get to broken WNs
● Two sources of tests● Glidein Factory● VO Frontend
● Of course, if the validation script cannot be fetched from either Web server, it is considered a failure as well
glideinWMS training Grid debugging 15
Types of tests
● The glideinWMS SW comes with a set of standard tests (provided by the factory):● Grid environment present (e.g. CAs)● Some free disk on $PWD and on /tmp● Enough FE-provided proxy lifetime remaining● gLExec related tests● OS type
● Each VO may have its own needs, e.g.:● Is VO SW pre-installed and accessible?
glideinWMS training Grid debugging 16
Discovering the problems
● Any error message printed out by the validation script will be delivered back to the factory● After the glidein terminates
● Most validation scripts provide clear indication what went wrong● And we strive to get all to do it!
● New machine readable format being introduced● With v2_6_2
glideinWMS training Grid debugging 17
Typical ops
● Noticing that a large fraction of glideins for a site are failing is easy● Just look at the monitoring● And we are getting a daily email as well
● Discovering what exactly is broken not too difficult either● Just parse the logs● Will get even easier when all scripts
return machine readable information
With appropriate tools
glideinWMS training Grid debugging 18
Action items
● Not much we can do directly● Typically, we open a ticket with the site
● Provide the list of nodes where it happens(rare to have the whole site broken)
● A concise but complete error report essential for a speedy resolution
● In minority of cases we have to contact the VO FE admin, e.g.● Unclear error messages● Non-WN specific validation errors
Unless this is the result of a
misconfigurationon our part
glideinWMS training Grid debugging 19
Black hole nodes
● There is one further WM problem● Black hole WNs● WNs that accept glidein jobs, but don't execute them
● glidein_startup never has the chance to log anything● Not even the node it is running on● Thus, empty log files!
● We can infer we have a black hole node at a site by looking at job timing (in Condor-G logs)● Good jobs run for at least 20 mins
glideinWMS training Grid debugging 20
Grid debugging
CE refusing the glideins
glideinWMS training Grid debugging 21
CE Refusing the glideins
● CE admin has the right to refuse anyone● But usually does not change his mind overnight● First time accessing a site an issue on its own
– Not covered here
● When things go wrong, the typical reason is● CE service down,● Problems in the Security/Auth infrastructure,● CE seriously misconfigured/broken
glideinWMS training Grid debugging 22
Expected vs Unexpected
● Some “problems” are expected● e.g. the CE is down for scheduled maintenance● Nothing to do in this case!
– Just a monitoring issue● So, checking the maintenance DB important!
● If not, we have to notify the site● The VO FEs are not getting the CPU slots
they are asking for
glideinWMS training Grid debugging 23
Discovering the problem
● Condor-G reacts in two different ways● Does nothing – We still have monitoring showing
the job did not progress from Waiting→Pending● Puts the job on Hold
● The G.Factory will react on Held jobs● Releasing them a few time → Condor-G retries● Removing them after a while
– Just to be replaced with identical glideins
For most non-trivial problemsthe problem does not solve by itself
glideinWMS training Grid debugging 24
Action items(for unexpected problems)
● Most of the time, not much we can do directly● Will just open ticket with site● If any useful info in the HoldReason, we pass it on● DN of the proxy the most valuable info
● But it could be our problem, too● Found many Condor-G problems in the past● Comparing the behavior of many G.Factory
instances can confirm or exclude thisAh-hoc solutions neededif this is the case
glideinWMS training Grid debugging 25
Grid debugging
CE not properlyhandling the glideins
glideinWMS training Grid debugging 26
Problematic CE
● Three basic types of problems:● Glideins not starting● Improper monitoring information● Output files not being delivered to client
● And there is two more● Unexpected policies that kill glideins
glideinWMS training Grid debugging 27
Glideins not starting
● The CE scheduling policy is not available to us● So often not obvious if we are just low priority or
something else is going on● GF/Condor-G does not see it as an error condition
● We usually don't act on it, unless● The VO FE admin complains, or● We have been given explicit guidance of the
expected startup rates
● Not much for us to investigate● Just tell the site admin “Jobs are not starting”
glideinWMS training Grid debugging 28
Glideins being killed by the site
● Ideally, our glideins should fit within the policies of the site● But sometimes they don't● So they get killed hard
● Discovering this from our side very hard● We often just notice empty log files● Not an error for Condor-G● Often learn of this because the VO complains
● If and when we understand the problem,we can deal with it ourselves● i.e. we config the glideins to stay within the limits
But getting this infois not trivial, remember?
glideinWMS training Grid debugging 29
Preemption
● Some site will preempt our glideins if higher priority jobs get into the queue● Effectively killing our glideins
● Not an actual error● Sites have the right to do it!
● But it can mess up with our monitoring/ops● We may see killed glideins, or● We may see glideins that seem to run for
a very long time (when automatically rescheduled on the CE)
● We have to efficiently filter these events out
glideinWMS training Grid debugging 30
Improper monitoring info from CE
● A CE may not provide reliable information● Each VO FE provides us with monitoring
information about its central manager● By comparing what it tells us, with what
the CE tells us, we can infer if there are problems● A large, consistent discrepancy typically signals
problems in the CE monitoring● Very difficult to figure out what is going on
● We have no direct detailed data to act upon● Mostly ad-hoc detective work, prodding the black box● Often inconclusive
glideinWMS training Grid debugging 31
Lack of output files
● The glidein output files contain● Accounting information● Detail logging
● Without other problems, mostly an annoyance● But much more often paired with glideins failing
● Making failure diagnostics close to impossible● Extremely hard to diagnose the root cause
● Sometimes we may infer it (black holes, killed glideins, ...)● For actual CE problems it requires help from many
parties, including us, the site admins and SW developers
glideinWMS training Grid debugging 32
Grid debugging
Networking problems
glideinWMS training Grid debugging 33
Glideins are network heavy
● Each glidein opens several long‑lived TCP connections (in CCB mode)● Can overwhelm networking gear
– e.g. NATs can run out of spare ports
● Problems can have non-linear behavior● Will work fine on small scale● Will degrade after a while
– Not necessarily a step function, thoughAlthough straight out
denials due to firewallsare also a problem
glideinWMS training Grid debugging 34
Diagnostics and action items
● Not trivial to detect● Errors often in the glidein logs● But difficult to interpret
● Not much we can do directly● A problem between the VO services and the site
– So we notify both
● However● we usually end up assisting as experts
And we are lackingtools for automatically
detecting this.
glideinWMS training Grid debugging 35
Grid debugging
Authentication problems
glideinWMS training Grid debugging 36
Security is delicate stuff
● Grid security mechanisms paranoid by design● “Availability” is the last to be considered● The main focus is keeping the “bad guys” out
● So they are extremely delicate● If any piece of the chain breaks, everything breaks
● Things that can go wrong (non exhaustive list):● Missing CA(s)
● Expired CRLs● Expired glidein proxy● Wrong system time (clock skew)
glideinWMS training Grid debugging 37
Diagnostics and action items
● Finding the root cause usually hard● Errors are in the glidein logs● But usually do not provide enough info
(to avoid giving up too much info to a hypothetical attacker)
● Have to distinguish between site problems and VO problems, too● Only obvious if only a fraction fails (→ WN problem)● Else, may need to get both sides involved to
properly diagnose the root cause
And we are lackingtools for automatically
detecting this.
glideinWMS training Grid debugging 38
Grid debugging
Job startup problems
glideinWMS training Grid debugging 39
gLExec (1)
● The biggest source of problems, by far,is gLExec refusing to accept a user proxy● Resulting in jobs not starting● BTW, Condor is not good at handling gLExec denials
● We can only partially test gLExec during validation● May behave differently based on the proxy used● Its behavior can change in time
● And final users may be the source of the problem● e.g. by letting the proxy expire Condor could catch
these, and hopefullysoon will
glideinWMS training Grid debugging 40
gLExec (2)
● Non trivial to detect● Errors are in the glidein logs● But we miss the tools to extract them
● Finding the root cause impossible without site admin help● gLExec policies are a site secret● We thus just notify the site,
providing the failing user DN
glideinWMS training Grid debugging 41
Configuration problems
● Condor can be configured to run a wrapper around the user job● To customize the user environment● Usually provided by the VO FE
● If that fails, the user job fails with it● Luckily, failures are rare
● If we notice them, we notify the VO FE admins● However, they often notice before we do
glideinWMS training Grid debugging 42
Other job startup problems
● By default, we validate the node only at glidein startup● WN conditions may change by the time a job
is scheduled to run– e.g. the disk fills up
● The errors are usually only seen by the final users● So we hardly ever notice
these kind of problems
We should do better.Condor supports
periodic validationtests, we just don'tuse them right now.
glideinWMS training Grid debugging 43
Summary
● The Grid world is a good approximation of a chaotic system● There are thus many failure modes
● The pilot paradigm hides most of the failures from the final users● But the failures are still there● Resulting in wasted/underused CPU cycles
● The G.Factory operators are in the best position to diagnose the root cause of the failures● By having a global view● However, they cannot solve the problems by themselves
glideinWMS training Grid debugging 44
Acknowledgments
● This document was sponsored by grants from the US NSF and US DOE,and by the UC system