7
Cartesius to Snellius migration Introduction After a very successful 8 years of service, Cartesius is being replaced this year by a new Dutch National Supercomputer called Snellius. Snellius is built by Lenovo and will contain predominantly AMD technology, plus NVIDIA GPUs. A high-level news item on the new system can be read . Snellius was here opened by Queen Máxima on 16 September. There you will also find a link to a of the system, with virtual tour background information. This page will be the main source of information where we will keep you as user updated on the progress of the installation of Snellius and the corresponding transition from Cartesius to Snellius. The purpose of this page is to help you adapt your research schedule to take into account the upcoming transition. Snellius (phase 1) at the Amsterdam Science Park Backside view of water-cooled GPU r What is the timeline for this transition? The Snellius system has been installed over the past months by Lenovo in the data center at Amsterdam Science Park, with Snellius being located next to Cartesius. There currently are system setup and administration steps left, plus finishing the set of system-wide acceptance tests. The dates provided below are therefore only valid when everything goes according to plan and no further delays are encountered in the remaining installation, setup and acceptance of the system. Access to Cartesius and the data on its file systems will be frozen from Friday October 15, 17:00. The weekend of 15 -17 October is used to perform the final incremental step of migrating user data from Cartesius to Snellius. Snellius was made available on Monday 18 October, at 14:00, but project spaces were initially still in 'read-only' since synchronization hadn't finished. As of Wednesday October 20, 10:30, project spaces are now fully synchronized and available (in read/write mode). In the week in which Snellius becomes available (18-22 October) please check that the migrated data on Snellius matches with what you expect . If not, please contact our , so we can decide on a user-by-user basis what action to take. Also see the next section on to be there Service Desk important differences between home and project spaces. Last update: Wednesday December 15, 10:23 The transition to Snellius has been completed, and most information on this page has therefore become less relevant. We leave it up for now, in order to provide background on the migration. For up-to-date information on Snellius you can visit these pages: Cartesius to Snellius changes, for a summary of how Snellius is different from Cartesius, both in hardware, but also software and user environment Snellius hardware and file systems, for a detailed overview of Snellius hardware Snellius usage and accounting, for background on budget and the way use of Snellius is accounted Snellius known issues, for a list of known issues that you should be aware of as a user If you have questions on the migration please contact our Service Desk.

Cartesius to Snellius migration

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Cartesius to Snellius migration

IntroductionAfter a very successful 8 years of service, Cartesius is being replaced this year by a new Dutch National Supercomputer called Snellius. 

Snellius is built by Lenovo and will contain predominantly AMD technology, plus NVIDIA GPUs.

A high-level news item on the new system can be read . Snellius was hereopened by Queen Máxima on 16 September. There you will also find a link to a of the system, with virtual tourbackground information.

This page will be the main source of information where we will keep you as user updated on the progress of the installation of Snellius and the corresponding transition from Cartesius to Snellius. The purpose of this page is to help you adapt your research schedule to take into account the upcoming transition.

Snellius (phase 1) at the Amsterdam Science Park Backside view of water-cooled GPU rack

What is the timeline for this transition?The Snellius system has been installed over the past months by Lenovo in the data center at Amsterdam Science Park, with Snellius being located next to Cartesius. There currently are system setup and administration steps left, plus finishing the set of system-wide acceptance tests. The dates provided below are therefore only valid when everything goes according to plan and no further delays are encountered in the remaining installation, setup and acceptance of the system.

Access to Cartesius and the data on its file systems will be frozen from Friday October 15, 17:00. The weekend of 15 -17 October is used to perform the final incremental step of migrating user data from Cartesius to Snellius.

Snellius was made available on Monday 18 October, at 14:00, but project spaces were initially still in 'read-only' since synchronization hadn't finished. As of Wednesday October 20, 10:30, project spaces are now fully synchronized and available (in read/write mode).

In the week in which Snellius becomes available (18-22 October) please check that the migrated data on Snellius matches with what you expect . If not, please contact our , so we can decide on a user-by-user basis what action to take. Also see the next section on to be there Service Desk

important differences between home and project spaces.

Last update: Wednesday December 15, 10:23

The transition to Snellius has been completed, and most information on this page has therefore become less relevant. We leave it up for now, in order to provide background on the migration.

For up-to-date information on Snellius you can visit these pages:

Cartesius to Snellius changes, for a summary of how Snellius is different from Cartesius, both in hardware, but also software and user environmentSnellius hardware and file systems, for a detailed overview of Snellius hardwareSnellius usage and accounting, for background on budget and the way use of Snellius is accountedSnellius known issues, for a list of known issues that you should be aware of as a user

If you have questions on the migration please contact our Service Desk.

Below is a tabular summary of the migration roadmap:

Date What Status

Fri 27 August

Start of home directory migration from Cartesius to Snellius

Migration finished

Start of project space migration from Cartesius to Snellius

Migration finished

Fri 15 October

Access to Cartesius disabled from 17:00 onwards

Fri 15 - Sun 17 October

Home directory migration finalized

Project space migration finalized Filesystem issues delayed completion of project space migration. Synchronization was complete as of October 20, 10:30. From that moment onwards, project spaces were have been fully available (read/write).

Mon 18 October

Snellius accessible and fully operational from 12:00

Delayed until 14:00. At this time, Snellius was opened with home directories fully available, and project spaces read-only (and potentially incomplete)/.

Snellius use will not be accounted in October

Mon 18 - Wed 20 October

Please check that the data that is available in your Snellius home dir and project spaces is as expected

Project space data should be complete as of October 20, 10:30. Please check your data.

End October

Cartesius taken offline

Beginning of November

Start of accounting on Snellius

How do I get my relevant data from Cartesius to Snellius?SURF will migrate relevant user data, the exact data migration schedule is given in the previous section, while the included set of data is described here.

In general, the data migration includes the data related to Cartesius SBU accounts that were active on, or after, the of . Note Cutoff Date 1 June 2021that the Cutoff Date is different from the Freeze Date from the previous section, and is a date about 3 months earlier. We plan to only migrate data of projects that are still active, or that have become inactive only fairly recently.

There are two types of end-user data collections: home directories and project spaces. Home directories are owned by a single login and have a logical pathname of the pattern  Project spaces are collectively co-owned by a group of logins - in some cases even by a /home/<loginname>. group of logins that belong to several different SBU accounts. Project spaces for end-users have a logical pathname of the pattern /projects/0

./<projectname>

Home directories

The home directory of a Cartesius login is migrated if the following two conditions apply:both

The login is associated with an SBU account that was active on, or after, the Cutoff Date (i.e. did not expire before the Cutoff Date).A valid Usage Agreement for the login exists, has been accepted/duly renewed by the person to whom the login was handed out to. This agreement can be reviewed and accepted .here

Time needed for project space migration

In general users so far can do more to help in reducing the migration time of project space data, by selecting critically what should really be migrated to Snellius and to actively purge their Cartesius project spaces of data that does not need to be migrated. Some may have also forgotten to do the last step: actually deleting files on Cartesius after migration of data to elsewhere.

Note that SURF cannot make the selection of what needs to be migrated, that is up to the user. But refraining from substantial purging will result in a huge aggregate data volume (approaching 5 PiB) still present in the Cartesius project spaces that would need to be migrated, taking a substantial amount of time.

Project spaces

A project space is migrated to Snellius if the following two conditions :both apply to at least one member login of the group co-owning the project space   

The login is associated with an SBU account that was active on, or after, the Cutoff Date (i.e. did not expire before the Cutoff Date).A valid Usage Agreement for the login exists, has been accepted/duly renewed by the person to whom the login was handed out to. This agreement can be reviewed and accepted .here

The group co-owing the project space is the group of logins that share the allocated disk quota and have read and write access to the project space root.

Scratch spaces

Non-native file systems

Data cleanup and preparation by users

Minimize the content of your    directories and of your project space(s) as much as possible:/home

Clean up your    directory and project space as much as possible./homeRemove obsolete files and directories.Move files from project space, that you would have transferred to local storage anyway, as soon as possible to this local storage.If you have access to the tape archive, please compress and move relevant data to the tape archive that you will not immediately need after migration. You can restore them at a later time on Snellius from the tape archive.Do not forget to actually delete the files on the project space after having done such transfers, otherwise they will still add up to the migration volume!

Make sure that you don't have links in your    folder that reference storage outside the    folder, as these links will be broken after /home /homethe migration to Snellius.

Cartesius home backup available only until 31 December 2021

Note that for home directories a daily backup service is maintained, both on Cartesius and Snellius. Offline backups of Cartesius home directories will be kept, , including for backups of directories that are not migrated. until 31 December 2021 Consequently, non-migrated home directories will become unavailable and non-restorable after 31 December 2021.

Note that for project spaces no backup service is in place, as project space is a user-managed scratch resource, a data-notpreservation resource. All project space data that is not included in the above will be migrated to Snellius and will become not  unavailableas soon as Cartesius is taken offline.

Scratch - Possible Data Loss

Files that reside on scratch filesystems of Cartesius will be migrated to Snellius. If you want to preserve your data currently on a notCartesius scratch file system you will have to copy this data yourself to an external data storage facility.

Archive

The migration only pertains to data on native Cartesius filesystems. In particular, data associated with the same login, but residing on the SURF archive facility, are not affected in any way. 

To keep the data migration from Cartesius to Snellius to manageable proportions, we kindly ask the cooperation of all users. Take seriously into account that Cartesius and Snellius both are resources for active projects. They have no data-preservation or archival computationalfunction at all. Relevant data to migrate are sources, scripts, and data to be operated on by computation and visualisation runs to be performed on Snellius. Data may be intrinsically valuable for other reasons, but those do not constitute valid reasons to keep them on the file systems of these compute platforms.

1. 2. 3.

Around 100 long-term active logins apparently still have kept some contents in the legacy directories migrated from the previous system Huygens, in 2013.  They are in a location with a pathname pattern  /nfs/home[12345]/huygens_data/h0[12345678]/<loginname>. For convenient access,  a  symbolic link, , pointing to the legacy directory was created in 2013, in the regular Cartesius home ~/HUYGENSdirectory of the login. . If you really want to keep some of that None of the Huygens legacy home directories will be migrated to Snelliuscontents, make sure that you move relevant legacy files into a different location, under your home directory. If you intend to upload large input datasets to project space consider postponing this operation until the transition to Snellius is complete.

What does the Snellius system look like?Like Cartesius, Snellius will also be a heterogeneous system, with thin nodes, fat nodes, high-memory nodes and a number of nodes with GPU accelerators. Snellius will be delivered in several phases, so the growth of Snellius will follow the anticipated growth in usage of Snellius. The growth phases are as follows.

Phase 1 (Q3 2021)

The hardware installed in this phase provides a peak compute performance of 3.1 PFLOP/s (CPU) + 3.0 PFLOP/s (GPU).

Type Amount Technical Details Memory/core (GiB) Total #cores Total #GPUcards

thin nodes 504 AMD Rome [email protected] processor, dual socket, 64 cores/socket

2 64,512

fat nodes 72 AMD Rome [email protected] processor, dual socket, 64 cores/socket

8 9,216

high-memory nodes 2 AMD Rome [email protected] processor, dual socket, 64 cores/socket,

32 256

high-memory nodes 2 AMD Rome [email protected] processor, dual socket, 64 cores/socket

64 256

GPU nodes 36 Intel Ice Lake [email protected] processor, dual socket, 36 cores/socket

4x NVIDIA A100 Tensor Core GPU

7.1

40

2,592 144

total 616 76,832 144

Phase 2 (Q3 2022)

An extension will be added with more CPU-only thin nodes (future generation AMD EPYC processors, 2 GB per core), with a peak performance of 5.1 PFLOP/s.

Phase 3 (Q3 2023)

There are three options for this extension:

CPU thin nodes (same future generation AMD EPYC processors, aggregate: 2.4 PFLOP/s), orGPU nodes (future generation NVIDIA GPUs, aggregate: 10.3 PFLOP/s), orStorage (the amount still needs to be determined)

The choice will be made 1.5 years after the start of production of Phase 1 and will be based on actual usage and demand of the system.

When Phase 3 is complete Snellius will have a total performance (CPU+GPU) in the range 13.6 - 21.5 PFLOP/s. This corresponds to roughly 7.6 - 11.9 times the total peak compute performance of Cartesius.

I still have an active account on Cartesius, what will happen with this account and the budget?Accounts that were active on, or after, the of will be migrated to Snellius. The remains of the budget which you had on Cutoff Date 1 June 2021Cartesius will be transferred to Snellius, in a 1:1 conversion of SBUs left.

Newly granted projects, that have a start date after 1 August 2021 will receive new accounts only on Snellius. We will not create a Cartesius account for granted projects which start after 1 August 2021. If the new grant is a continuation of a previous project, the new budget will be accessible on Cartesius along with an already existing account.

How can I get access to Snellius?

The procedure for obtaining access to Snellius will be similar to the one for Cartesius. For large applications, you submit an application to NWO, see the call details . For small, pilot, applications you can apply via the SURF access portal. For more details see the information . Of course, the here heredefinitions of what is considered big and small will be adapted to reflect the increased capacity of Snellius.

How does the Snellius software environment compare to the Cartesius environment?Snellius will use the same type of modules environment for providing software packages as used on Cartesius. We will do our best to port the software from the modules environment on Cartesius to the new modules environment on Snellius. This implies that Snellius will host newer 2020 2021versions of the software that is currently available on Cartesius.

We have been building a new modules environment on Cartesius already. This environment currently contains the most important core libraries 2021and development tools. We will continue to enhance the modules environment on Cartesius by frequently adding more packages. Please note, 2021the modules environment on Cartesius is subject to changes. . Thus, users can already try to 2021 This environment is opened to users for testing onlyrebuild their software on Cartesius, test new versions of libraries, and adapt their current workflow to the upcoming new system. This will significantly reduce the effort of setting up your environment on Snellius after the migration.

On Snellius we will use a GNU-based toolchain ( ) as the main one and install the software modules in the 2021 environment using this toolchain. fossHowever, for advanced users, we will provide a complete toolchain as a module in the new modules environment.intel 2021

2019 and pre2019 module environments

We will not migrate the and modules environments to Snellius.2019 pre2019

Custom built software and locally-installed modules

Users that use in-house developed software, or more generally that build software themselves on Cartesius, will have to rebuild that software on Snellius. The same applies to locally installed modules, where you will have to reinstall these modules on Snellius.

To facilitate the transition from Cartesius to Snellius for users who have locally installed modules we intend to install the and intel/2020a foss/2020atoolchains on Snellius. Please note, we will provide these toolchains only for compatibility with the previous modules-environment on Cartesius. We will not install any software system-wide on Snellius using these toolchains.

Similarities and Differences between Cartesius and Snellius

Hardware characteristics

Feature Cartesius Snellius

CPU architecture  Intel Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Knights Landing

AMD 7H12 (Rome), 64 cores/socket, 2.6GHz

GPU architecture NVIDIA Kepler - K40 NVIDIA Ampere - A100

Node types thin nodes, fat nodes, GPU nodes thin nodes, fat nodes, high-memory nodes, GPU nodes

Number of nodes (total cores/GPUs)

Thin nodes: 1620 (38,880 cores)

Broadwell nodes: 177 (5,664 cores)

Fat nodes: 32 (1,024 cores)

GPU nodes: 66 (132 GPUs, 1,056 cores)

Thin nodes: 504 (64,512 cores)

Fat nodes: 72 (9,216 cores)

High-memory nodes: 4 (512 cores)

GPU nodes: 36 (144 GPUs, 2,592 cores)

If you are missing essential software on Cartesius, please let us know via the as soon as from the current 2020 environment ServiceDeskpossible. You can also install the missing modules by yourself in your home folder, see the EasyBuild tutorial.

If you still use modules from the or modules environment, replace them as much as possible with the equivalent from the pre2019 2019 202modules environment that is already available on Cartesius.1 

Cores per node Thin nodes: 24 (2S x 12 cores/socket)

Broadwell nodes: 32 (2S x 16 cores/socket)

Fat nodes: 32 (4S x 8 cores/socket)

GPU nodes: 2 GPUs/node + 16 CPU cores (2S x 8 cores/socket)

Thin nodes: 128 (2S x 64 cores/socket)

Fat nodes: 128 (2S x 64 cores/socket)

High-memory nodes: 128 (2S x 64 cores/socket)

GPU nodes: 4 GPUs/node + 72 CPU cores (2S x 36 cores/socket

Memory per node Thin nodes: 64 GB (2.66 GB/core)

Broadwell nodes: 64 GB (2GB/core)

Fat nodes: 256 GB (8 GB/core)

GPU nodes: 96GB (6GB/node, 48 GB/GPU)

Thin nodes: 256GB (2 GB/core)

Fat nodes: 1TB (8GB/core)

High-memory nodes: 4/8 TB (32-64GB/core)

GPU nodes: 512GB (7.11 GB/core, 128 GB/GPU)

Interconnect Infiniband FDR (56Gbps), pruned Infiniband HDR100 (100Gbps), fat tree

Storage filesystems Home filesystem: 180TB

Parallel filesystem: Lustre, 7.7PB

Home filesystem: 720 TB

Parallel filesystem: Spectrum Scale (GPFS), 12.4PB

NVMe parallel filesystem: Spectrum Scale (GPFS), 200TB

Fat nodes include 6.4 TB of NVMe local storage

Software and usage characteristics

Scheduler SLURM SLURM

Node usage Exclusive (jobs take full nodes) Shared (jobs can share nodes) and exclusive 

Operating system CentOS7 CentOS8/Rocky Linux/RHEL8

Modules environment pre2019, 2019, 2020, (part of) 2021 2021

Provided compiler suites Intel, GNU, PGI (NVIDIA) Intel, GNU, PGI (NVIDIA), AMD, LLVM

Provided toolchains foss and intel; versions 2018b and 2020a foss and intel; versions 2020a and 2021a

Toolchains used for system-wide software installation foss, intel foss

Accounting 1 SBU/core-hour 1 SBU/core-hour

Pilot application limits 500,000 SBUs 1,000,000 SBUs (tentatively)

SBU cost per node type

Node type SBU per core-hour

CPU cores per node

SBU per hour (full node)

Thin 1 128 128

Fat 1.5 128 192

High-memory 2 128 256

Super high-memory 3 128 384

GPU 128 SBU for1 GPU + 18 CPU cores

512

Note that all node types will allow for shared allocations, where multiple users are using part of a node simultaneously.