60
UCSD Jan 18th 2012 Frontend Monitoring 1 glideinWMS Training @ UCSD glideinWMS Frontend Monitoring by Igor Sfiligoi (UCSD)

glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

Embed Size (px)

DESCRIPTION

This talk walks you through the monitoring options a glideinWMS Frontend operator has. Part of the glideinWMS Training session held in Jan 2012 at UCSD.

Citation preview

Page 1: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 1

glideinWMS Training @ UCSD

glideinWMS FrontendMonitoring

by Igor Sfiligoi (UCSD)

Page 2: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 2

Overview

● Refresher● What is available● What to look for

Page 3: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 3

Refresher – glideinWMS

● A glidein is just a properly configured Condor execution node submitted as a Grid job● Frontend drives submission

Factory node

Condor

Factory

Frontend node

Frontend

CREAM

Globus

Submit node

Submit node

Central manager

Execution nodeglidein

Execution nodeglidein

Worker node

glideinMonitorCondor

Requestglideins

Submitglideins

MatchStartd

Job

Configure Condor G.N.

Page 4: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 4

Condor is king!(glideinWMS just a small layer on top)

Reminder

Page 5: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 5

Refresher – Frontend arch

● Many Groups● With a “Master” Frontend as an aggregator

Frontend node

Factory

Frontend

EntryGroup Group

Spawn

...

Factory

glidein

WebServer

Submit node

Submit node

Central manager

Page 6: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 6

Available monitoring

● Condor monitoring● It is just a condor pool!● Any Condor monitoring tools will work

● VO Frontend monitoring● The VO Frontend provides some basic

Condor monitoring● Plus the monitoring of it own internal workings

● Glidein Factory monitoring

Even if a dynamic one

You should not need to use itbut it is publicly accessible

Page 7: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 7

Condor monitoring

Page 8: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 8

Condor Monitoring

● Out of the box you get● Command line tools● Log parsing

● Several external tools available, e.g.● CondorView● CycleServer

Condor external package

Commercial tool, (semi-)free for AcademiaYour portal mayprovide additional

monitoring, too

Page 9: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 9

Glidein monitoring

● The glideins will register with the Collector● Condor command to monitor themcondor_status● -constraint - To select a subset of them● -total - For a quick summary

● Output formatting options● No arguments - In use/unused● -long - Full ClassAds● -format - Select attributes only● -xml - xml formatting

Same syntax asRequirements

Easier tomachine parsehttp://www.cs.wisc.edu/condor/manual/v7.6/condor_status.html

Page 10: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 10

Example

$ condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

glidein_17848@alic LINUX X86_64 Claimed Busy 7.440 18037 0+01:06:06glidein_15842@alic LINUX X86_64 Claimed Busy 7.010 18037 0+00:35:21glidein_18249@alic LINUX X86_64 Claimed Busy 7.510 18037 0+01:24:09glidein_17825@wn89 LINUX X86_64 Unclaimed Idle 11.990 16056 0+00:15:12glidein_10082@wn91 LINUX X86_64 Claimed Idle 7.000 16056 0+00:02:46…glidein_3964@wp-05 LINUX X86_64 Claimed Busy 24.000 64464 0+16:00:29glidein_5614@wp-05 LINUX X86_64 Claimed Busy 23.360 64464 0+16:12:56glidein_5861@wp-05 LINUX X86_64 Claimed Retiring 22.140 64464 0+00:23:18 Total Owner Claimed Unclaimed Matched Preempting Backfill

X86_64/LINUX 23249 0 22697 552 0 0 0

Total 23249 0 22697 552 0 0 0

$ condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

glidein_17848@alic LINUX X86_64 Claimed Busy 7.440 18037 0+01:06:06glidein_15842@alic LINUX X86_64 Claimed Busy 7.010 18037 0+00:35:21glidein_18249@alic LINUX X86_64 Claimed Busy 7.510 18037 0+01:24:09glidein_17825@wn89 LINUX X86_64 Unclaimed Idle 11.990 16056 0+00:15:12glidein_10082@wn91 LINUX X86_64 Claimed Idle 7.000 16056 0+00:02:46…glidein_3964@wp-05 LINUX X86_64 Claimed Busy 24.000 64464 0+16:00:29glidein_5614@wp-05 LINUX X86_64 Claimed Busy 23.360 64464 0+16:12:56glidein_5861@wp-05 LINUX X86_64 Claimed Retiring 22.140 64464 0+00:23:18 Total Owner Claimed Unclaimed Matched Preempting Backfill

X86_64/LINUX 23249 0 22697 552 0 0 0

Total 23249 0 22697 552 0 0 0

Page 11: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 11

Another example

$ condor_status -format '%-50s ' Name -format '%6i\n' GLIDEIN_Max_Walltime \ -const "GLIDEIN_Max_Walltime>83000"[email protected] [email protected] [email protected] [email protected] [email protected] [email protected] 114840$ condor_status -format '%-50s ' Name -format '%6i\n' GLIDEIN_Max_Walltime -xml \ -const "GLIDEIN_Max_Walltime>83000"<?xml version="1.0"?><!DOCTYPE classads SYSTEM "classads.dtd"><classads><c> <a n="MyType"><s>Machine</s></a> <a n="TargetType"><s>Job</s></a> <a n="Name"><s>[email protected]</s></a> <a n="GLIDEIN_Max_Walltime"><i>86040</i></a> <a n="CurrentTime"><e>time()</e></a></c>...

$ condor_status -format '%-50s ' Name -format '%6i\n' GLIDEIN_Max_Walltime \ -const "GLIDEIN_Max_Walltime>83000"[email protected] [email protected] [email protected] [email protected] [email protected] [email protected] 114840$ condor_status -format '%-50s ' Name -format '%6i\n' GLIDEIN_Max_Walltime -xml \ -const "GLIDEIN_Max_Walltime>83000"<?xml version="1.0"?><!DOCTYPE classads SYSTEM "classads.dtd"><classads><c> <a n="MyType"><s>Machine</s></a> <a n="TargetType"><s>Job</s></a> <a n="Name"><s>[email protected]</s></a> <a n="GLIDEIN_Max_Walltime"><i>86040</i></a> <a n="CurrentTime"><e>time()</e></a></c>...

Page 12: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 12

Collector log(s)

● The Collector(s) will log any errors● The interesting errors will likely be in the leaves of

the Collector tree~condor/glidecondor/condor_local/log/CondorXXXLog

● Logs rotate, so be sure to look in .old as well

● You also get the glidein authentication logs● And log verbosity can be further increased withCOLLECTOR_DEBUG

Place to look when things seem fishy!

http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:SubsysDebug

Yes, you willhave 100sof them!

Page 13: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 13

Example

01/13/12 17:24:13 ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN=uscmspilot47/glidein-1.t2.ucsd.edu'01/13/12 17:24:13 ZKM: 2: mapret: 0 included_voms: 0 canonical_user: glidein4701/13/12 17:24:13 ZKM: successful mapping to glidein47...01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 4 bytes from <130.104.133.245:7812>.01/13/12 17:24:19 DaemonCore: Can't receive command request from 130.104.133.245 (perhaps a timeout?)...01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=10716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is currently registered with that id (perhaps it recently disconnected).

01/13/12 17:24:13 ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN=uscmspilot47/glidein-1.t2.ucsd.edu'01/13/12 17:24:13 ZKM: 2: mapret: 0 included_voms: 0 canonical_user: glidein4701/13/12 17:24:13 ZKM: successful mapping to glidein47...01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 4 bytes from <130.104.133.245:7812>.01/13/12 17:24:19 DaemonCore: Can't receive command request from 130.104.133.245 (perhaps a timeout?)...01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=10716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is currently registered with that id (perhaps it recently disconnected).

Page 14: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 14

Job monitoring

● You can monitor local jobs● For jobs still in the queue (still waiting or running)condor_q

● For finished jobscondor_history

● Similar cmdline args as condor_status● Remote condor_q possible with

-name

Limited number of jobspreserved

http://www.cs.wisc.edu/condor/manual/v7.6/condor_q.htmlhttp://www.cs.wisc.edu/condor/manual/v7.6/condor_history.html

Page 15: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 15

Example

$ condor_q

-- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 367788.0 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 1 1 367788.1 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 2 1 383995.19 uscms1789 1/11 02:26 2+13:35:38 R 0 1953.1 CMSSW.sh 1118 4 383995.179 uscms1789 1/11 02:26 2+11:29:06 R 0 1464.8 CMSSW.sh 1310 4 383999.32 uscms1789 1/11 02:31 2+09:36:12 R 0 1953.1 CMSSW.sh 299 4 383999.46 uscms1789 1/11 02:31 2+11:00:25 R 0 1953.1 CMSSW.sh 316 4 …385002.7 uscms3015 1/13 17:31 0+00:01:51 R 0 0.0 CMSSW.sh 70 2 385002.8 uscms3015 1/13 17:31 0+00:01:49 R 0 0.0 CMSSW.sh 89 2 385002.9 uscms3015 1/13 17:31 0+00:01:29 R 0 0.0 CMSSW.sh 91 2 385002.10 uscms3015 1/13 17:31 0+00:01:00 R 0 0.0 CMSSW.sh 97 2

58707 jobs; 39484 idle, 11694 running, 7529 held

$ condor_q

-- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 367788.0 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 1 1 367788.1 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 2 1 383995.19 uscms1789 1/11 02:26 2+13:35:38 R 0 1953.1 CMSSW.sh 1118 4 383995.179 uscms1789 1/11 02:26 2+11:29:06 R 0 1464.8 CMSSW.sh 1310 4 383999.32 uscms1789 1/11 02:31 2+09:36:12 R 0 1953.1 CMSSW.sh 299 4 383999.46 uscms1789 1/11 02:31 2+11:00:25 R 0 1953.1 CMSSW.sh 316 4 …385002.7 uscms3015 1/13 17:31 0+00:01:51 R 0 0.0 CMSSW.sh 70 2 385002.8 uscms3015 1/13 17:31 0+00:01:49 R 0 0.0 CMSSW.sh 89 2 385002.9 uscms3015 1/13 17:31 0+00:01:29 R 0 0.0 CMSSW.sh 91 2 385002.10 uscms3015 1/13 17:31 0+00:01:00 R 0 0.0 CMSSW.sh 97 2

58707 jobs; 39484 idle, 11694 running, 7529 held

Page 16: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 16

Job logs

● Users are encouraged to have a log for jobs● Provides easy way to monitor the progress without

calling condor_q/condor_history

000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569>...001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422>...005 (001.000.000) 12/16 13:30:32 Job terminated. (1) Normal termination (return value 0) Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 217 - Run Bytes Sent By Job 76 - Run Bytes Received By Job 217 - Total Bytes Sent By Job 76 - Total Bytes Received By Job...

000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569>...001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422>...005 (001.000.000) 12/16 13:30:32 Job terminated. (1) Normal termination (return value 0) Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 217 - Run Bytes Sent By Job 76 - Run Bytes Received By Job 217 - Total Bytes Sent By Job 76 - Total Bytes Received By Job...

Lite

rally

...

Page 17: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 17

Condor Daemon logs

● By default● Schedd writes a log/opt/glidecondor/condor_local/log/ScheddLog

● Shadows share a common log/opt/glidecondor/condor_local/log/ShadowLog

● The logs rotate, look for .old files as well

● Lots of interesting info in them● Quite high verbosity by default

Page 18: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 18

ScheddLog Example

01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.401/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN=rokpilot01/osg.ctbp.ucsd.edu'01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100...01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824>01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE01/13/12 18:05:02 (pid:32035) Negotiating for owner: [email protected]/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd [email protected] <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd [email protected] <169.228.131.217:34929?CCBID=169.228.40.37:9623#108&noUDP> for cfeng01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd [email protected] <169.228.131.156:33784?CCBID=169.228.40.37:9708#95&noUDP> for cfeng01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138)01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on [email protected] <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng, (shadow pid = 5238)

01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.401/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN=rokpilot01/osg.ctbp.ucsd.edu'01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100...01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824>01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE01/13/12 18:05:02 (pid:32035) Negotiating for owner: [email protected]/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd [email protected] <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd [email protected] <169.228.131.217:34929?CCBID=169.228.40.37:9623#108&noUDP> for cfeng01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd [email protected] <169.228.131.156:33784?CCBID=169.228.40.37:9708#95&noUDP> for cfeng01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138)01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on [email protected] <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng, (shadow pid = 5238)

Page 19: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 19

01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP>01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586>01/12/12 21:52:36 Setting maximum accepts per cycle 4.01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.10801/12/12 21:52:36 (157171.97) (32318): Request to run on [email protected] <169.228.131.154:48495?CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated: exited with status 001/12/12 21:52:36 (157170.39) (10937): **** condor_shadow (condor_SHADOW) pid 10937 EXITING WITH STATUS 100…01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28)SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>; STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz: (errno 28) No space left on device01/13/12 18:01:15 (157177.52) (4768): **** condor_shadow (condor_SHADOW) pid 4768 EXITING WITH STATUS 112

01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP>01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586>01/12/12 21:52:36 Setting maximum accepts per cycle 4.01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.10801/12/12 21:52:36 (157171.97) (32318): Request to run on [email protected] <169.228.131.154:48495?CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated: exited with status 001/12/12 21:52:36 (157170.39) (10937): **** condor_shadow (condor_SHADOW) pid 10937 EXITING WITH STATUS 100…01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28)SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>; STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz: (errno 28) No space left on device01/13/12 18:01:15 (157177.52) (4768): **** condor_shadow (condor_SHADOW) pid 4768 EXITING WITH STATUS 112

ShadowLog Example

Page 20: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 20

Submitter ClassAds

● The schedd will advertise two types of ClassAds to the Collector● Schedd daemon ClassAdscondor_status -schedd

● Per-user ClassAdscondor_status -submitter

● Can be useful for getting a summary view of the system

Page 21: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 21

Example

$ condor_status -schedd

Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs

cmsfnal01.fnal.gov cmsfnal01. 0 0 0glidein-2.t2.ucsd.ed glidein-2. 10932 38480 7607submit-2.t2.ucsd.edu submit-2.t 11103 8955 1667vocms120.cern.ch vocms120.c 0 4024 2 TotalRunningJobs TotalIdleJobs TotalHeldJobs

Total 22035 51459 9276$ condor_status -schedd -l submit-2.t2.ucsd.edu Name = "submit-2.t2.ucsd.edu"MaxJobsRunning = 20000TotalHeldJobs = 1667TotalIdleJobs = 9347…TotalJobAds = 22096TransferQueueDownloadWaitTime = 0MyType = "Scheduler"

$ condor_status -schedd

Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs

cmsfnal01.fnal.gov cmsfnal01. 0 0 0glidein-2.t2.ucsd.ed glidein-2. 10932 38480 7607submit-2.t2.ucsd.edu submit-2.t 11103 8955 1667vocms120.cern.ch vocms120.c 0 4024 2 TotalRunningJobs TotalIdleJobs TotalHeldJobs

Total 22035 51459 9276$ condor_status -schedd -l submit-2.t2.ucsd.edu Name = "submit-2.t2.ucsd.edu"MaxJobsRunning = 20000TotalHeldJobs = 1667TotalIdleJobs = 9347…TotalJobAds = 22096TransferQueueDownloadWaitTime = 0MyType = "Scheduler"

Page 22: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 22

Example

$ condor_status -submitter

Name Machine Running IdleJobs HeldJobs

uscms1789@glidein-2. glidein-2. 344 0 20uscms1811@glidein-2. glidein-2. 176 1141 0uscms1976@glidein-2. glidein-2. 629 0 7…[email protected] submit-2.t 405 0 [email protected] vocms120.c 0 4000 0 RunningJobs IdleJobs HeldJobs

[email protected] 11 0 1uscms1537@glidein-2. 0 0 1uscms1811@glidein-2. 176 1141 [email protected] 177 3324 0…[email protected] 3107 289 [email protected] 405 0 [email protected] 0 0 42

Total 22092 51518 9280

$ condor_status -submitter

Name Machine Running IdleJobs HeldJobs

uscms1789@glidein-2. glidein-2. 344 0 20uscms1811@glidein-2. glidein-2. 176 1141 0uscms1976@glidein-2. glidein-2. 629 0 7…[email protected] submit-2.t 405 0 [email protected] vocms120.c 0 4000 0 RunningJobs IdleJobs HeldJobs

[email protected] 11 0 1uscms1537@glidein-2. 0 0 1uscms1811@glidein-2. 176 1141 [email protected] 177 3324 0…[email protected] 3107 289 [email protected] 405 0 [email protected] 0 0 42

Total 22092 51518 9280

Page 23: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 23

Negotiator Monitoring

● To check for user priorities, usecondor_userprio● -alluser - Without, only running users● -all - Provides detailed info

● Negotiator Log useful to troubleshoot~/glidecondor/condor_local/log/NegotiatorLog

● Look for errors and to monitor cycle times

● Negotiator also advertises a ClassAd● Use condor_status -negotiator -long

Page 24: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 24

Example 1/2

$ condor_userprio -all -allusersLast Priority Update: 1/13 18:33 Effective Real Priority Res ...User Name Priority Priority Factor Used ...------------------------------ --------- -------- ------------ ---- [email protected] 158.01 15.80 10.00 0 [email protected] 205.37 20.54 10.00 0 [email protected] 559.11 0.56 1000.00 0 [email protected] 576.15 0.58 1000.00 0 [email protected] 775.26 0.78 1000.00 0 [email protected] 827.95 0.83 1000.00 0 [email protected] 1455.42 1.46 1000.00 0 [email protected] 1677.00 1.68 1000.00 0 [email protected] 2113.44 2.11 1000.00 0 [email protected] 2493.31 2.49 1000.00 0 [email protected] 2506.61 2.51 1000.00 0 [email protected] 2771.17 2.77 1000.00 0 [email protected] 5150.52 5.15 1000.00 0 [email protected] 5357.76 5.36 1000.00 176 ...

$ condor_userprio -all -allusersLast Priority Update: 1/13 18:33 Effective Real Priority Res ...User Name Priority Priority Factor Used ...------------------------------ --------- -------- ------------ ---- [email protected] 158.01 15.80 10.00 0 [email protected] 205.37 20.54 10.00 0 [email protected] 559.11 0.56 1000.00 0 [email protected] 576.15 0.58 1000.00 0 [email protected] 775.26 0.78 1000.00 0 [email protected] 827.95 0.83 1000.00 0 [email protected] 1455.42 1.46 1000.00 0 [email protected] 1677.00 1.68 1000.00 0 [email protected] 2113.44 2.11 1000.00 0 [email protected] 2493.31 2.49 1000.00 0 [email protected] 2506.61 2.51 1000.00 0 [email protected] 2771.17 2.77 1000.00 0 [email protected] 5150.52 5.15 1000.00 0 [email protected] 5357.76 5.36 1000.00 176 ...

Page 25: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 25

Example 2/2

$ condor_userprio -all -allusersLast Priority Update: 1/13 18:33 … Total Usage Usage Last User Name … (wghted-hrs) Start Time Usage Time ------------------------------ … ----------- ---------------- [email protected] … 82863.87 10/03/2011 01:41 1/11/2012 07:[email protected] … 202430.74 10/31/2011 01:30 1/12/2012 02:[email protected] … 437667.09 7/02/2011 08:06 1/08/2012 07:[email protected] … 47024.87 10/09/2011 13:26 1/07/2012 01:[email protected] … 3677.14 11/23/2011 08:12 1/10/2012 01:[email protected] … 1309024.85 6/03/2009 00:48 1/07/2012 15:[email protected] … 81864.63 9/26/2011 15:22 1/07/2012 05:[email protected] … 6966.57 10/10/2011 22:48 1/09/2012 17:[email protected] … 57125.01 5/27/2011 02:00 1/09/2012 21:[email protected] … 85581.04 8/06/2011 12:45 1/09/2012 07:[email protected] … 158894.51 10/11/2011 11:11 1/08/2012 17:[email protected] … 13528.66 9/05/2011 02:15 1/09/2012 23:[email protected] … 10824.76 9/28/2011 05:02 1/09/2012 03:[email protected] … 304430.61 11/17/2009 11:04 1/13/2012 18:33

$ condor_userprio -all -allusersLast Priority Update: 1/13 18:33 … Total Usage Usage Last User Name … (wghted-hrs) Start Time Usage Time ------------------------------ … ----------- ---------------- [email protected] … 82863.87 10/03/2011 01:41 1/11/2012 07:[email protected] … 202430.74 10/31/2011 01:30 1/12/2012 02:[email protected] … 437667.09 7/02/2011 08:06 1/08/2012 07:[email protected] … 47024.87 10/09/2011 13:26 1/07/2012 01:[email protected] … 3677.14 11/23/2011 08:12 1/10/2012 01:[email protected] … 1309024.85 6/03/2009 00:48 1/07/2012 15:[email protected] … 81864.63 9/26/2011 15:22 1/07/2012 05:[email protected] … 6966.57 10/10/2011 22:48 1/09/2012 17:[email protected] … 57125.01 5/27/2011 02:00 1/09/2012 21:[email protected] … 85581.04 8/06/2011 12:45 1/09/2012 07:[email protected] … 158894.51 10/11/2011 11:11 1/08/2012 17:[email protected] … 13528.66 9/05/2011 02:15 1/09/2012 23:[email protected] … 10824.76 9/28/2011 05:02 1/09/2012 03:[email protected] … 304430.61 11/17/2009 11:04 1/13/2012 18:33

Page 26: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 26

NegotiatorLog Example01/13/12 18:23:05 ---------- Finished Negotiation Cycle ----------01/13/12 18:24:09 ---------- Started Negotiation Cycle ----------01/13/12 18:24:09 Phase 1: Obtaining ads from collector ...01/13/12 18:24:09 Getting all public ads ...01/13/12 18:24:44 Sorting 23021 ads ...01/13/12 18:24:46 Getting startd private ads ...01/13/12 18:24:51 Got ads: 23021 public and 22571 private01/13/12 18:24:51 Public ads include 38 submitter, 22568 startd01/13/12 18:24:51 Phase 2: Performing accounting ...01/13/12 18:25:01 Phase 3: Sorting submitter ads by priority ...01/13/12 18:25:01 Phase 4.1: Negotiating with schedds ...01/13/12 18:25:01 Negotiating with [email protected] at <169.228.130.26:9615?sock=10263_1229_2>01/13/12 18:25:01 0 seconds so far01/13/12 18:25:02 Request 345869.00000:01/13/12 18:25:02 Rejected 345869.0 [email protected] <169.228.130.26:9615?sock=10263_1229_2>: no match found01/13/12 18:25:02 Got NO_MORE_JOBS; done negotiating…01/13/12 18:25:06 Request 384970.00170:01/13/12 18:25:06 Matched 384970.170 [email protected] <169.228.130.11:9615?sock=9763_cd4c_2> preempting none <192.168.3.77:55906?CCBID=169.228.130.23:9823#13833&noUDP> [email protected]/13/12 18:25:06 Successfully matched with [email protected]

01/13/12 18:23:05 ---------- Finished Negotiation Cycle ----------01/13/12 18:24:09 ---------- Started Negotiation Cycle ----------01/13/12 18:24:09 Phase 1: Obtaining ads from collector ...01/13/12 18:24:09 Getting all public ads ...01/13/12 18:24:44 Sorting 23021 ads ...01/13/12 18:24:46 Getting startd private ads ...01/13/12 18:24:51 Got ads: 23021 public and 22571 private01/13/12 18:24:51 Public ads include 38 submitter, 22568 startd01/13/12 18:24:51 Phase 2: Performing accounting ...01/13/12 18:25:01 Phase 3: Sorting submitter ads by priority ...01/13/12 18:25:01 Phase 4.1: Negotiating with schedds ...01/13/12 18:25:01 Negotiating with [email protected] at <169.228.130.26:9615?sock=10263_1229_2>01/13/12 18:25:01 0 seconds so far01/13/12 18:25:02 Request 345869.00000:01/13/12 18:25:02 Rejected 345869.0 [email protected] <169.228.130.26:9615?sock=10263_1229_2>: no match found01/13/12 18:25:02 Got NO_MORE_JOBS; done negotiating…01/13/12 18:25:06 Request 384970.00170:01/13/12 18:25:06 Matched 384970.170 [email protected] <169.228.130.11:9615?sock=9763_cd4c_2> preempting none <192.168.3.77:55906?CCBID=169.228.130.23:9823#13833&noUDP> [email protected]/13/12 18:25:06 Successfully matched with [email protected]

Page 27: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 27

CycleServer Screenshots

● Can do more than just monitoring● But the rest beyond the scope of this talk

Page 28: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 28

Frontend Monitoring

Page 29: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 29

Frontend node

Frontend monitoring

● Helper cmdline tool● Plus, each Group provides:

● Activity/Error logs● RRD files with statistics (running, held, etc.)● XML files with current snapshot● Resource ClassAds

● Master frontend aggregates RRD and XML files, and writes them in its own area● Human readable/viewable Web pages available

Frontend

EntryGroup Group

Spawn

...

Page 30: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 30

Helper cmdline tool

● Wrapper around condor condor_statusglideinWMS/tools/glidein_status.py

● Provides useful formatting

~/glideinWMS/tools$ ./glidein_status.py

Name Site Factory Entry State Activity ActvtyTime

[email protected] Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy 0+00:51:[email protected] Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy 0+00:48:17…[email protected] Legnaro v1_0@OSGGOC CMS_T2_IT_Legnaro_. Claimed Retiring 0+02:34:17

Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed Matched Other

CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC 11 0 11 0 0 0 0 0 CMS_T2_US_Purdue_hansen@v1_0@OSGGOC 522 0 517 0 0 5 0 0 CMS_T2_US_Purdue_osg@v1_0@OSGGOC 1201 0 1182 14 0 5 0 0…CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD 135 0 132 0 0 3 0 0

Total 21474 0 19742 1264 0 468 0 0

~/glideinWMS/tools$ ./glidein_status.py

Name Site Factory Entry State Activity ActvtyTime

[email protected] Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy 0+00:51:[email protected] Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy 0+00:48:17…[email protected] Legnaro v1_0@OSGGOC CMS_T2_IT_Legnaro_. Claimed Retiring 0+02:34:17

Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed Matched Other

CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC 11 0 11 0 0 0 0 0 CMS_T2_US_Purdue_hansen@v1_0@OSGGOC 522 0 517 0 0 5 0 0 CMS_T2_US_Purdue_osg@v1_0@OSGGOC 1201 0 1182 14 0 5 0 0…CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD 135 0 132 0 0 3 0 0

Total 21474 0 19742 1264 0 468 0 0

Page 31: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 31

Log files

● Each Frontend group provides 3 types of logslog/group_XXX/frontend.date.type.log

● info - Progress and warnings● err - One line warnings● debug - Multi line error messages

● The master frontend has similar logslog/frontend/frontend.date.type.log

● But rarely anything interesting there

Page 32: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 32

Example Info Log

[2011-11-15T10:44:01-07:00 15037] Iteration at Tue Nov 15 10:44:01 2011[2011-11-15T10:44:01-07:00 15037] Query condor[2011-11-15T10:44:01-07:00 15037] Child processes created[2011-11-15T10:44:05-07:00 31633] WARNING: Failed to talk to schedd submit-1.t2.ucsd.edu. See debug log for more details.[2011-11-15T10:44:05-07:00 15037] All children terminated[2011-11-15T10:44:05-07:00 15037] Jobs found total 4836 idle 1732 (old 1732, voms 1703) running 3104[2011-11-15T10:44:05-07:00 15037] Glideins found total 639 idle 8 running 630 limit 800 curb 600[2011-11-15T10:44:05-07:00 15037] Using 1 proxies[2011-11-15T10:44:05-07:00 15037] Match[2011-11-15T10:44:05-07:00 15037] Counting[2011-11-15T10:44:05-07:00 15037] Child processes created[2011-11-15T10:44:06-07:00 15037] All children terminated[2011-11-15T10:44:06-07:00 15037] Total matching idle 1732 (old 1703) running 3104[2011-11-15T10:44:06-07:00 15037] Jobs in schedd queues | Glideins | Request [2011-11-15T10:44:06-07:00 15037] Idle (match eff old uniq ) Run ( here max ) | Total Idle Run | Idle MaxRun Down Factory[2011-11-15T10:44:06-07:00 15037] 171( 1705 170 169 0) 3104( 102 250) | 105 1 103 | 10 3276 Up CMS_T2_US_Nebraska_Red@Production_v4_1@[email protected][2011-11-15T10:44:06-07:00 15037] 171( 1705 167 169 0) 3104( 187 250) | 197 4 193 | 10 3276 Up CMS_T2_US_Nebraska_Red_gw1@Production_v4_1@[email protected][2011-11-15T10:44:06-07:00 15037] 171( 1705 171 169 0) 3104( 0 250) | 0 0 0 | 10 3276 Down CMS_T2_US_Nebraska_Red_gw2@Production_v4_1@[email protected][2011-11-15T10:44:06-07:00 15037] 171( 1705 171 169 0) 3104( 62 250) | 62 0 62 | 10 3276 Up CMS_T2_US_Wisconsin_cms01@Production_v4_1@[email protected][2011-11-15T10:44:06-07:00 15037] 171( 1705 171 169 0) 3104( 71 250) | 71 0 71 | 10 3276 Up CMS_T2_US_Wisconsin_cms02@Production_v4_1@[email protected][2011-11-15T10:44:06-07:00 15037] 171( 1705 169 169 0) 3104( 88 250) | 96 2 94 | 10 3276 Up CMS_T2_US_Nebraska_Red@v1_0@[email protected][2011-11-15T10:44:06-07:00 15037] 171( 1705 171 169 0) 3104( 1 250) | 1 0 1 | 10 3276 Up CMS_T2_US_Nebraska_Red_gw1@v1_0@[email protected][2011-11-15T10:44:06-07:00 15037] 171( 1705 171 169 0) 3104( 0 250) | 0 0 0 | 10 3276 Down CMS_T2_US_Nebraska_Red_gw2@v1_0@[email protected][2011-11-15T10:44:06-07:00 15037] 171( 1705 171 169 0) 3104( 45 250) | 45 0 45 | 10 3276 Up CMS_T2_US_Wisconsin_cms01@v1_0@[email protected][2011-11-15T10:44:06-07:00 15037] 171( 1705 170 169 0) 3104( 60 250) | 62 1 61 | 10 3276 Up CMS_T2_US_Wisconsin_cms02@v1_0@[email protected][2011-11-15T10:44:06-07:00 15037] Jobs in schedd queues | Glideins | Request [2011-11-15T10:44:06-07:00 15037] Idle (match eff old uniq ) Run ( here max ) | Total Idle Run | Idle MaxRun Down Factory[2011-11-15T10:44:06-07:00 15037] 1368(13640 1360 1352 0) 24832( 616 2000) | 639 8 630 | 80 26208 Up Sum of useful factories[2011-11-15T10:44:06-07:00 15037] 342( 3410 342 338 0) 6208( 0 500) | 0 0 0 | 20 6552 Down Sum of down factories[2011-11-15T10:44:06-07:00 15037] 27( 27 27 14 27) 0( 0 0) | 0 0 0 | 0 0 Down Unmatched[2011-11-15T10:44:06-07:00 15037] Advertizing 10 requests[2011-11-15T10:44:07-07:00 15037] Done advertizing[2011-11-15T10:44:07-07:00 15037] Advertising 10 glideresource classads to the user pool[2011-11-15T10:44:07-07:00 15037] Done advertising glideresource classads[2011-11-15T10:44:07-07:00 15037] Writing stats[2011-11-15T10:44:07-07:00 15037] Sleep

Page 33: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 33

Example log files

[2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details.[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details.

[2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details.[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details.

[2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running '/home/frontend/glidecondor/sbin/condor_advertise -pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2'code 1:failed to send classad to <169.228.130.10:9618>failed to send classad to <169.228.130.10:9618>failed to send classad to <169.228.130.10:9618>failed to send classad to <169.228.130.10:9618>

[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd 'vocms120.cern.ch' not found

[2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running '/home/frontend/glidecondor/sbin/condor_advertise -pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2'code 1:failed to send classad to <169.228.130.10:9618>failed to send classad to <169.228.130.10:9618>failed to send classad to <169.228.130.10:9618>failed to send classad to <169.228.130.10:9618>

[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd 'vocms120.cern.ch' not found

frontend.20120113.err.log

frontend.20120113.debug.log

Page 34: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 34

Web pages 1/3

Historical overview

Fully dynamic,allows for zoomingand selecting ofelements to plot

Default shows everything,but can restrict to a groupand/or a Factory

frontendStatus.html

Page 35: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 35

Web pages 2/3

frontendGroupGraphStatusNow.html

Current snapshot in tabular form

Useful for spotting problems

Page 36: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 36

Web pages 3/3

frontendGroupGraphStatusNow.html

Contains also pie-charts with the same info

Page 37: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 37

RRDs and XML files

● The Web pages are just rendering of the RRDs and XML pages● Raw data loaded in the browser and rendered● No server side code

● Other tools could use those data● Publicly available, if one knows the URL● No user-identifying data, only summary stats

Page 38: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 38

Resource ClassAds

● The Frontend Groups advertise one ClassAd for each Factory it is requesting glideins from● Type glideresource

● They contain pretty much everything the Frontend Group knows about the Factory:● Factory attributes used for matchmaking● Stats about the matching jobs● What is being requested● Even what the Factory is doing!

Page 39: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 39

Example query

$ condor_status -any -const 'MyType=="glideresource"' -format '%s\n' NameCMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main...CMS_T2_US_UCSD_gw2@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_Wisconsin_cms01@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_Wisconsin_cms02@v1_0@OSGGOC@UCSD-v5_3.main

$ condor_status -any -const 'MyType=="glideresource"' -format '%s\n' NameCMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main...CMS_T2_US_UCSD_gw2@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_Wisconsin_cms01@v1_0@OSGGOC@UCSD-v5_3.mainCMS_T2_US_Wisconsin_cms02@v1_0@OSGGOC@UCSD-v5_3.main

● Not a Condor native type, must use● -any● Then constrain the type

Remotely queryable

Page 40: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 40

Example ClassAd$ condor_status -any \ CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -lMyType = "glideresource"Name = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main"GlideClientName = "UCSD-v5_3.main"...GlideClientMonitorJobsIdle = 210.000000GlideClientMonitorJobsRunningHere = 213...GlideClientMonitorGlideinsRequestIdle = 50GlideClientMonitorGlideinsRequestMaxRun = 445...GLIDEIN_Site = "UCSD"GLEXEC_BIN = "OSG"...GlideClientMonitorGlideinsRunning = 215GlideClientMonitorGlideinsTotal = 216...GlideFactoryMonitorStatusRunning = 339GlideFactoryMonitorStatusPending = 277GlideFactoryMonitorStatusHeld = 0...

$ condor_status -any \ CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -lMyType = "glideresource"Name = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main"GlideClientName = "UCSD-v5_3.main"...GlideClientMonitorJobsIdle = 210.000000GlideClientMonitorJobsRunningHere = 213...GlideClientMonitorGlideinsRequestIdle = 50GlideClientMonitorGlideinsRequestMaxRun = 445...GLIDEIN_Site = "UCSD"GLEXEC_BIN = "OSG"...GlideClientMonitorGlideinsRunning = 215GlideClientMonitorGlideinsTotal = 216...GlideFactoryMonitorStatusRunning = 339GlideFactoryMonitorStatusPending = 277GlideFactoryMonitorStatusHeld = 0...

Identification

Info about local jobs

What is being requested

Factory attributes

Factory status

Currently more information than you get on the Web

Info about registered glideins

Page 41: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 41

OK, now you know what's available.

What will you dowith all that information?

(i.e. What to look for)

Page 42: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 42

Monitoring the health of the system

● Six major areas to look after; your goal is● Few unclaimed glideins

(both globally, and per site)● No unmatched jobs● Reasonably low restart rate

(both global, and per site)● Reasonably low job failure rate

(both global, and per site)● Negotiation cycle reasonably short● Schedd node not overloaded

Page 43: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 43

Unclaimed glideins

● Frontend and Negotiator policies are not identical● You may end up with glideins that

never run any jobs

● The discrepancy can be big enough to be noticed on a global scale● But more often it is just for one (or few) sites

● Short spikes are not a problem● But long periods are

Page 44: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 44

How do you notice it?

● Historical Web monitoring

● Ask for daily emails from the Factory● Or write your own scripts

Good

Bad

No Frontend report generatorsin glideinWMS at this time

Parse the RRDs

Page 45: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 45

How do you find the root cause?

● Analyze the latest snapshots● condor_status/glidein_status● condor_q● Frontend Web

● Limit the research to few sites, if possible● Then start comparing

● Job Requirements, with● Glidein Start expressions

Can be daunting!

In theory, there is “condor_q -ana”, but it is usually worthless

Page 46: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 46

Unmatched jobs

● The other side of the problem● Glideins never asked for some jobs

● Two possible reasons● Wrong Frontend matchmaking policy● No available Factory entries to serve the job

Jobs will never start!

Page 47: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 47

How do you notice it?

● “Unmatched Factory” in Web monitoring

Page 48: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 48

How do you find the root cause?

● Again, start with the latest snaphot● condor_q● condor_status -any -const 'MyType=="glideresource"'

● Get the (python) Match expression from XML● Start comparing!

Can be daunting!

Page 49: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 49

Restarted jobs

● Any restart == wasted CPU● How do you notice it?

● condor_q is your friend herecondor_q -format '%i\n' NumJobStarts

● Why it happens?● Glidein disappears!● End of lifetime hit● Preemption policies● Submit node overload

No historical/Web monitoring provided

Not in the default config, but you may set Condor to do it

Condor daemons do not like being resource constrained!

Page 50: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 50

Why glideins disappear?

● Three main reasons● Remote node just died● Site preemption policy● Glidein killed by Site because it exceeded slot limits

– Most likely Memory

● Why can limits be exceeded?● Job underestimated resource use● Frontend matchmaking logic problem● Wrong advertised limits

Rare

Some sites do this; nothing you can do.Learn who they are and act accordingly.

Factory problem!

Job told you it neededmore resources than the limit!

One of 2 limits the OSG factory advertises.GLIDEIN_MaxMemMBs

Page 51: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 51

Wallclock limits

● Main resource limit is time● The glidein automatically deals with it

– Will go away before the deadline– … killing/preemptiong any jobs if needed!

● Limit advertised as– Factory: GLIDEIN_Max_Walltime (-Δ)– Glidein: GLIDEIN_ToDie

● Why jobs may reach the deadline?● Like with all other resources

– Job underestimates time it needs– Frontend matchmaking logic problems

In seconds

UNIX time

Page 52: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 52

Job failures

● Jobs can fail for many reasons● You should monitor the ExitCode

condor_history -back -const 'JobStatus==5' -format '%i\n' ExitCode

● Knowing what users run often needed to interpret errors

● For common WN errors, Frontend admin should create appropriate validation script● So glideins fail, not user jobs

Page 53: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 53

Negotiation time

● The negotiation time should be << 5mins● If much longer,

glideins may terminate without running any jobs● Monitor the NegotiatorLog on CM

● Possible causes● CPU starvations (e.g. other processes)● Autocluster explosion

– Condor tries to be smart about Matchmaking– But if users don't cooperate, cannot do much

Page 54: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 54

Autoclustering

● Condor Schedd will try to group jobs● All “similar jobs” will be matched together!

● What “similar” means?● Similar == Would result in the same match

● How it is implemented?● Tuple of attributes considered during matchmaking● E.g. (DESIRED_Sites,ImageSize)

● How can the number of autoclusters explode?● If an attribute that changes a lot is added

Much fasterif only few

groups exist

https://condor-wiki.cs.wisc.edu/index.cgi/attach_get/220/cs739.pdf

Example of really bad one: JobID

Page 55: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 55

Submit node health

● Condor is very sensitive to resource starvation● If submit node overloaded, expect problems!

● How can we get to resource starvation?● Poor planning● Other processes

● Interactive activity particularly risky● Due to its unpredictable nature

– Including user errors● But portals not immune to resource overuse

Trying to run 3k jobs on a 1G RAM node???

May steal CPU/RAM/IO from Condor

Page 56: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 56

Summary

Page 57: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 57

Summary

● You have plenty of Monitoring options● Some prettier, some more powerful

● Most of the time, things just work● So you don't need to constantly watch after your

installation

● But occasionally things will break● It is in your interest noticing it● Having good monitoring tools will help you there!

Or the users will tell you!

Page 58: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 58

The End

Page 59: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 59

Pointers

● The official glideinWMS project Web page ishttp://tinyurl.com/glideinWMS

● glideinWMS development team is reachable [email protected]

● The OSG glidein factory is reachable [email protected]

Page 60: glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

UCSD Jan 18th 2012 Frontend Monitoring 60

Acknowledgments

● The glideinWMS is a CMS-led project developed mostly at FNAL, with contributions from UCSD and ISI

● The glideinWMS factory operations at UCSD is sponsored by OSG

● The funding comes from NSF, DOE and the UC system