7
Not reviewed, for internal circulation only ATLAS WORLD-cloud and networking in PanDA 1 F Barreiro Megino 1 , K De 1 , A Di Girolamo 2 , T Maeno 3 and R Walker 4 on behalf 2 of the ATLAS Collaboration 3 1 University of Texas at Arlington, Arlington, USA 4 2 European Organization for Nuclear Research, Geneva, Switzerland 5 3 Brookhaven National Laboratory, Upton, USA 6 4 Ludwig-Maximilians-Universität München, Munich, Germany 7 E-mail: [email protected] 8 9 Abstract. The ATLAS computing model was originally designed as static clouds (usually 10 national or geographical groupings of sites) around the Tier 1 centres, which confined tasks 11 and most of the data traffic. Since those early days, the sites' network bandwidth has increased 12 at O(1000) and the difference in functionalities between Tier 1s and Tier 2s has reduced. After 13 years of manual, intermediate solutions, we have now ramped up to full usage of World-cloud, 14 the latest step in the PanDA Workload Management System to increase resource utilization on 15 the ATLAS Grid, for all workflows (MC production, data (re)processing, etc.). 16 We have based the development on two new site concepts. Nuclei sites are the Tier 1s and 17 large Tier 2s, where tasks will be assigned and the output aggregated, and satellites are the sites 18 that will execute the jobs and send the output to their nucleus. PanDA dynamically pairs nuclei 19 and satellite sites for each task based on the input data availability, capability matching, site 20 load and network connectivity. This contribution will introduce the conceptual changes for 21 World-cloud, the development necessary in PanDA, an insight into the network model and the 22 first half-year of operational experience. 23 1. Introduction 24 The original ATLAS[1] Computing Model was designed as static clouds, setting the boundaries for 25 data transfers. Clouds were mostly defined as national or geographic groupings of sites following the 26 layout of national research networks. This model was hierarchical with clear distinctions in the Tier 1- 27 2-3 level. 28 PanDA(Production and Distributed Analysis system)[2] is ATLAS’ data-driven workload 29 management system, developed to cope with ATLAS scale and implement the Computing Model 30 requirements. PanDA enforced particular policies for the hierarchical model: 31 1. The output of tasks 1 had to be aggregated in the Tier 1s (O(10)); 32 2. Tasks had to be inflexibly executed within a static cloud. 33 This model worked, but was getting outdated as the WLCG (Worldwide LHC Computing Grid)[3] 34 networks significantly evolved[4]. In the last two decades the bandwidth has increased at O(1000), and 35 the network is far more reliable. Limiting transfers within a cloud was generally no longer needed. 36 The static cloud model also had negative effects on storage and led to an uneven usage between Tiers: 37 Tier 2 storage was only used for secondary replicas and was therefore not optimally exploited. 38 1 In ATLAS nomenclature a task is a grouping of jobs

ATLAS WORLD-cloud and networking in PanDA - CERN fileonly 1 ATLAS WORLD-cloud and networking in PanDA 2 F Barreiro Megino1, K De1, A Di Girolamo2, T Maeno3 and R Walker4 on behalf

Embed Size (px)

Citation preview

Not

revi

ewed

,for

inte

rnal

circ

ulat

ion

only

ATLAS WORLD-cloud and networking in PanDA 1

F Barreiro Megino1, K De1, A Di Girolamo2, T Maeno3 and R Walker4 on behalf 2 of the ATLAS Collaboration 3 1University of Texas at Arlington, Arlington, USA 4 2European Organization for Nuclear Research, Geneva, Switzerland 5 3Brookhaven National Laboratory, Upton, USA 6 4Ludwig-Maximilians-Universität München, Munich, Germany 7

E-mail: [email protected] 8 9 Abstract. The ATLAS computing model was originally designed as static clouds (usually 10 national or geographical groupings of sites) around the Tier 1 centres, which confined tasks 11 and most of the data traffic. Since those early days, the sites' network bandwidth has increased 12 at O(1000) and the difference in functionalities between Tier 1s and Tier 2s has reduced. After 13 years of manual, intermediate solutions, we have now ramped up to full usage of World-cloud, 14 the latest step in the PanDA Workload Management System to increase resource utilization on 15 the ATLAS Grid, for all workflows (MC production, data (re)processing, etc.). 16 We have based the development on two new site concepts. Nuclei sites are the Tier 1s and 17 large Tier 2s, where tasks will be assigned and the output aggregated, and satellites are the sites 18 that will execute the jobs and send the output to their nucleus. PanDA dynamically pairs nuclei 19 and satellite sites for each task based on the input data availability, capability matching, site 20 load and network connectivity. This contribution will introduce the conceptual changes for 21 World-cloud, the development necessary in PanDA, an insight into the network model and the 22 first half-year of operational experience. 23

1. Introduction 24 The original ATLAS[1] Computing Model was designed as static clouds, setting the boundaries for 25 data transfers. Clouds were mostly defined as national or geographic groupings of sites following the 26 layout of national research networks. This model was hierarchical with clear distinctions in the Tier 1-27 2-3 level. 28 PanDA(Production and Distributed Analysis system)[2] is ATLAS’ data-driven workload 29 management system, developed to cope with ATLAS scale and implement the Computing Model 30 requirements. PanDA enforced particular policies for the hierarchical model: 31

1. The output of tasks1 had to be aggregated in the Tier 1s (O(10)); 32 2. Tasks had to be inflexibly executed within a static cloud. 33

This model worked, but was getting outdated as the WLCG (Worldwide LHC Computing Grid)[3] 34 networks significantly evolved[4]. In the last two decades the bandwidth has increased at O(1000), and 35 the network is far more reliable. Limiting transfers within a cloud was generally no longer needed. 36 The static cloud model also had negative effects on storage and led to an uneven usage between Tiers: 37 Tier 2 storage was only used for secondary replicas and was therefore not optimally exploited. 38 1 In ATLAS nomenclature a task is a grouping of jobs

Not

revi

ewed

,for

inte

rnal

circ

ulat

ion

only

In addition, high priority tasks tended to get stuck at small clouds, needing frequent manual 39 intervention. 40 Before WORLD cloud, ATLAS evaluated the intermediate Multi Cloud Production solution, where 41 some sites belonged to multiple clouds. WORLD cloud attempts to go one step further and break all 42 boundaries in a controlled fashion. 43

2. WORLD cloud 44 WORLD cloud attempts to introduce a new model, where processing site groups are defined 45

dynamically for each task. It is based on two new concepts: 46 1. Nuclei: 47

a. Task brokerage (see section 4.1.) will choose a nucleus for each task based on various 48 criteria; 49

b. The task output will be aggregated in the nucleus; 50 c. The capability of a site to be a nucleus is defined manually in AGIS (ATLAS Grid 51

Information System [5]): Tier 1s and the most reliable, big Tier 2s are defined as 52 nuclei. 53

2. Satellites: 54 a. Satellite sites will execute jobs and transfer the output to the nucleus; 55 b. Job brokerage (see section 4.2.) selects up to ten satellites for each task, based on 56

usual criteria (e.g. number of jobs and data availability); 57 c. Satellites are selected across the globe: a network weight will bias towards well 58

connected nuclei and satellites. 59

3. Configurator 60 We have implemented a new component to collect the information necessary for WORLD cloud: 61 Configurator (see Figure 1). The data is cached in dedicated tables in the PanDA database, so it can be 62 used by task brokerage to define the nucleus and job brokerage to define the satellites of a task. 63

64 Figure 1: Configurator sources of information 65

3.1. Storage occupancy 66 Configurator polls Rucio[6], the data management system in ATLAS, for the storage usage (total size, 67 used space, free space and expired space2) of each endpoint. 68

3.2. Network data 69 2 Expired datasets are candidates for deletion when space is needed

Not

revi

ewed

,for

inte

rnal

circ

ulat

ion

only

The ATLAS analytics platform[7] collects Rucio and FAX[8] transfer events, as well as PerfSonar[9] 70 test metrics. The Network Weather Service[10] (NWS) periodically aggregates the data and publishes 71 a json file with following metrics for each source-destination pair: 72

• Number of files transferred by Rucio in the last hour; 73 • Number of files queued in Rucio; 74 • Throughput according to FTS[11] (the WLCG File Transfer System used by Rucio), 75

aggregated for the last hour, day and week. Note that some links might be missing if there 76 were no transfers; 77

• Throughput according to FAX; 78 • Latency, packet loss and throughput as measured by PerfSonar tests. 79 The Configurator agent downloads this information every 30 min and caches it in a key-value 80

table in PanDA DB. The key-value design allows to add new metrics without modifying the structure 81 of the table. 82

3.3. Site configuration 83 Configurator queries AGIS, the ATLAS Grid Information System, whether a site qualifies as a task 84 nucleus. Currently the operations team defines nuclei manually based on size, reliability and good 85 network connectivity. 86 AGIS also provides a static network link classification, which is used when no recent FTS/Rucio 87 transfer data is available. This classification is calculated on a monthly basis, relying on the Sonar 88 framework3 that should submit periodically transfer to test the whole ATLAS Grid mesh. 89

4. Brokerage 90 Before WORLD cloud, task brokerage would choose a cloud for the task and the output would be 91 consolidated at the cloud’s Tier 1. Job brokerage would then select a site within the cloud for each job. 92 These algorithms have been adapted for WORLD cloud. Now task brokerage chooses the task nucleus, 93 while job brokerage pairs the nucleus with up to ten satellites capable of processing the task. The 94 algorithms for task and job brokerage are described in the next subsections. 95

4.1. Task brokerage 96 One nucleus is chosen for each task. On the first pass, all nuclei are checked against a series of hard 97 criteria: 98 • The nucleus must be in active state; 99 • The size of free space in the nucleus must be larger than 5 TB. The calculation of free space 100

includes the expired space and an estimation of the space that will be filled by the remaining work 101 assigned to the nucleus; 102

• In order to keep data movement under control, if the total input size is larger than 1GB or over 100 103 files, at least 10% of the input data must be available in the nucleus; 104

• In order to avoid overloading the link to a nucleus, the queue of output files must be under 2000. 105 106 If multiple nuclei meet the selection criteria, one nucleus is chosen based on the weight based on: 107

• Total RW: the total sum of Remaining Work (RW) in the nucleus with priority equal or larger 108 than the current task; 109

• Available input size: the available input size for the assigned task in the nucleus; 110 • Total input size: the total size of input for the task; 111 • Tape weight: penalization if input is available only on the tape input. 112

3 Sonar is a cron-like tool that submits transfers across all links and thereby ensures base measurements for each

source destination pair. �

Not

revi

ewed

,for

inte

rnal

circ

ulat

ion

only

4.2. Job brokerage 113 Up to 10 satellites are selected to execute the jobs of each task. On the first pass, all candidates are 114 checked against a series of hard criteria: 115 • The satellites must be able to run the jobs: 116

o Memory requirements; 117 o Walltime requirements; 118 o Core count requirements; 119 o Software releases must be installed at the site. 120

• Sites must have less than 150 files in transfer to the task nucleus. The value was established 121 empirically and is configurable. According to our observations, links reaching the transfer backlog 122 threshold were unusable and transfers were piling up. 123

Eligible sites compete on a weight basis based on the relationship of queued and running jobs, data 124 availability and network connectivity to the nucleus, since they will have to transfer the output files. 125

5. Discussion 126

5.1. Nuclei-satellite matching 127 On each brokerage decision we send the calculated network weights to the ATLAS analytics platform, 128 so we can visualize aggregated statistics. Figure 2 shows the average network weight over 24 hours 129 between nucleus AGLT2 (University of Michigan Tier 2) and all satellites, sorted by decreasing 130 weight. 131

132 Figure 2 Network weight from nucleus AGLT2 (Michigan Univ.) to satellites in decreasing order 133

The top 10 matched sites are marked with a dot: yellow for US intra-cloud and blue for inter-cloud. It 134 is interesting to see that in the top 10 matched sites we find some of the expected intra-cloud US sites, 135 but also several inter-cloud sites. The US Tier1 is the best matched site to AGLT2 (apart of itself) and 136 we find other US Tier 2s in the top 10: MWT2, SWT2_CPB, UTA_SWT2. Regarding well connected 137 inter-cloud sites, we can see GRIF-IRFU in France, the T0 CERN-PROD, several UK sites (UKI-138

Not

revi

ewed

,for

inte

rnal

circ

ulat

ion

only

SOUTHGRID, UKI-LT2-RHUL, UKI-LT2-BRUNEL) and CYFRONET-LCG2 in Poland. These 139 European sites have shown high throughput to AGLT2 and low output transfer queues, proving the 140 idea that the satellite selection does not necessarily have to be done inside a static, regional cloud, but 141 it is possible to dynamically exploit inter-cloud links using recent transfer metrics. 142

5.2. Impact on Tier 2 disk usage 143 144 WORLD cloud was activated at the end of March 2016 and nuclei are being added progressively. 145 Currently all Tier 1s and approx. 20% of the Tier 2s qualify as nuclei. Tentatively more Tier 2s will be 146 added, targeting around half of the sites. 147 WORLD cloud shows an immediate effect in the usage of Tier 2 nuclei space. Tier 2 nuclei started 148 holding primary copies of data, balancing out the distribution of primary data. Figure 3 shows the 149 evolution on the AGLT2 nucleus, where the effect is extremely visible. On the ensemble of Tier 2s the 150 effect is noticeable, but could be further improved by adding more Tier 2 nuclei. 151 152

153 154

155 Figure 3: Evolution of disk space usage for AGLT2_DATADISK endpoint and for all Tier 2 DATADISK endpoints 156

5.3. WORLD cloud impact on file transfer duration 157 In order to see the impact of WORLD cloud on the transfer duration, we have plotted the average file 158 transfer duration over 2016 (see Figure 4). WORLD cloud was activated end of March 2016 and the 159 network weights for brokerage were implemented in May 2016. During April and May ATLAS also 160 conduced heavy campaigns. These facts could have induced the increase of average file transfer 161 duration during those 2 months. Once the network weights were put in production during May, the 162

Not

revi

ewed

,for

inte

rnal

circ

ulat

ion

only

average file transfer dropped back to pre-WORLD averages. This shows that implementing dynamic, 163 controlled, worldwide clouds does not have penalties on the output file transfer duration. 164

165 Figure 4: Average file transfer duration in Jan to Sept 2016 166

6. Conclusions and future work 167 168 We have replaced the concept of static clouds by dynamically generated processing clouds, and 169 introduced the concepts of nucleus and satellite for a task. We have introduced a series of controls to 170 match well-connected nuclei and satellites, and have introduced hard limit queue controls to alleviate 171 issues and deviate the traffic from blocked sites. The new system has reduced operational effort and 172 manual interventions to re-broker problematic tasks stuck at particular clouds. The usage of Tier 2 173 storage and the distribution of primary data copies have improved. In the future, manual enabling of 174 nuclei could be automated in order to improve the storage usage further. 175 We have optimized output file transfer, but still need to work on the optimization of input file 176 transfers. This case needs to be solved together with the Rucio team, since it involves synchronized 177 decisions across both components for cases of multiple copies, tape staging, etc. 178 A study on the overall effect of WORLD cloud on job/task queue and execution times is in progress. 179

References 180 [1] The ATLAS Collaboration 2008 JINST 3 S08003 181 [2] Maeno T et al 2008 Journal of Physics: Conference Series 119 062036 � 182 [3] LHC Computing Grid: Technical Design Report, document LCG-TDR-001, CERN-LHCC-183

�2005-024 (The LCG TDR Editorial Board) (2005) � 184 [4] Martelli E et al 2015 Journal of Physics: Conference Series 664 052025 185 [5] Anisenkov A et al 2015 Journal of Physics: Conference Series 664 062001 186 [6] Garonne V et al 2014 Journal of Physics: Conference Series 513 042021 187 [7] Vukotic I et al 2016 Big Data analytics tools as applied to ATLAS event data Journal of 188

Physics: Conference Series (pre press) 189 [8] Gardner R et al 2014 Journal of Physics: Conference Series 513 042049 190 [9] McKee S et al 2012 Journal of Physics: Conference Series 396 042038 191 [10] Lassnig M et al 2016 Using machine learning algorithms to forecast network and system load 192

metrics for ATLAS Distributed Computing Journal of Physics: Conference Series (pre 193

Not

revi

ewed

,for

inte

rnal

circ

ulat

ion

only

press) 194 [11] Ayllon A et al 2014 Journal of Physics: Conference Series 513 032081 195