FAX status. Overview Status of endpoints and redirectors Monitoring Failover Overflow
Preview:
Citation preview
- Slide 1
- FAX status
- Slide 2
- Overview Status of endpoints and redirectors Monitoring
Failover Overflow
- Slide 3
- Endpoints Status on Sat. 15 Nov. Got one more site: RO-07-NIPNE
Problems: We work on CSCS Not working at all: Nikhef Flip-flopping:
FZK-LCG2 and NDGF-T1
- Slide 4
- Direct access Expired cert Wrong config Test jobs were unable
to get proxy
- Slide 5
- Upstream redirection
- Slide 6
- Downstream redirection Redirectors moved to AI machines
- Slide 7
- Moving redirectors Herve had to move all the EU redirectors to
the Agile Infrastructure. Simultaneously upgraded to xrootd 4.0.4.
Started with DE redirector. Had to re-implement access rules.
Continued with two redirectors per day. But old machines got
re-introduced, confused everybody. A new set of changes being
applied right now. Now situation clear, but sites need to restart
their services as IPs changed.
- Slide 8
- Monitoring Machine receiving info from AMQ and giving it to SSB
etc. had to move to Agile Infrastructure. Took much more time then
expected but its done now. EU sites were moving to sending
monitoring data to CERN. Current state may be seen here (thanks to
Igor Pelevanyuk): http://dashb-
xrootd-comp.cern.ch/cosmic/ATLASmigrationMonitoring/ http://dashb-
xrootd-comp.cern.ch/cosmic/ATLASmigrationMonitoring/ Still a lot of
effort needed to make summary and detailed monitoring match:
http://dashb-ai-621.cern.ch/cosmic/DB_ML_Comparator/
http://dashb-ai-621.cern.ch/cosmic/DB_ML_Comparator/ Started deeper
analysis of Panda job info data transported into Hadoop at CERN.
Further improvements in FSB
- Slide 9
- Cost matrix
- Slide 10
- Overflow Slowly expanding: BNL still missing, even the reverse
proxy hardware is there. ANALY_AGLT2_SL6ANALY_INFN-T1
ANALY_CONNECTANALY_IN2P3-CC ANALY_BU_ATLASANALY_MPPMU
ANALY_MWT2_SL6ANALY_DESY-HH ANALY_OU_OCHEPANALY_QMUL_SL6 ANALY_SLAC
ANALY_SFU Cant use data from the rest of EU cloud
- Slide 11
- Snakey overflow plots - success
- Slide 12
- Snakey overflow plots - failures
- Slide 13
- Overflow - workload
- Slide 14
- Overflow workload
- Slide 15
- Overflow job efficiency
- Slide 16
- Slide 17
- Overflow CPU efficiency
- Slide 18
- Reactions Up to now only two sites noticed the overflows:
TRIUMF Jedi sent a lot of jobs to almost all US cloud sites, all
reading from TRIUMF. Saturated their proxy (1Gb/s). They since made
it 2 Gb/s. QMUL Chris Walker noticed 5Gbps+ at their NAT gateway,
~10TB/day. Not a problem for now.
- Slide 19
- Failover Jobs per 4 hours