XSEDE12 – Using Gordon, a Data- Intensive Supercomputer ...Data Oasis from Gordon – Itʼs the Routers!!! • Gordon has 64 I/O nodes which host the ﬂash and also serve as routers

SAN DIEGO SUPERCOMPUTER CENTER

XSEDE12 – Using Gordon, a Data-Intensive Supercomputer (part 2)

Data Transfer, Filesystems, Running on vSMP Nodes

Mahidhar Tatineni

07/16/2012"

1


Data Transfer (scp, globus-url-copy)!•  scp is o.k. to use for simple file transfers and

small file sizes (<1GB). Example:!$ scp w.txt [email protected]:/home/train40/w.txt 100% 15KB 14.6KB/s 00:00 "

•  globus-url-copy for large scale data transfers between XD resources (and local machines w/ a globus client). !•  Uses your XSEDE-wide username and password "•  Retrieves your certificate proxies from the central server"•  Highest performance between XSEDE sites, uses striping

across multiple servers and multiple threads on each server." 2


Data Transfer – globus-url-copy!•  Step 1: Retrieve certificate proxies:!

$ module load globus"$ myproxy-logon –l xsedeusername"Enter MyProxy pass phrase:"A credential has been received for user xsedeusername in /tmp/x509up_u555555.""

•  Step 2: Initiate globus-url-copy:!$ globus-url-copy -vb -stripe -tcp-bs 16m -p 4 gsiftp://gridftp.ranger.tacc.teragrid.org:2811///scratch/00342/username/test.tar gsiftp://trestles-dm2.sdsc.xsede.org:2811///oasis/scratch/username/temp_project/test-gordon.tar"Source: gsiftp://gridftp.ranger.tacc.teragrid.org:2811///scratch/00342/username/"Dest: gsiftp://trestles-dm2.sdsc.xsede.org:2811///oasis/scratch/username/temp_project/" test.tar -> test-gordon.tar"

3


Data Transfer – Globus Online!•  Works from Windows/Linux/Mac via globus

online website:!•  https://www.globusonline.org"

•  Gordon and Trestles endpoints already exist. Authentication can be done iusing XSEDE-wide username and password.!

•  Globus Connect application (available for Windows/Linux/Mac can turn your laptop/desktop into an endpoint.! 4


Data Transfer – Globus Online!•  Step 1: Create a globus online account!

5


Data Transfer – Globus Online!•  Step 2: Set up local machine as endpoint using

Globus Connect.!

6


Data Transfer – Globus Online!

7



8

•  Step 3: Pick Endpoints and Initiate Transfers!!



9


Gordon : Filesystems!•  Lustre filesystems – Good for scalable large block I/O!

•  Accessible from both native and vSMP nodes."•  /oasis/scratch/gordon – 1.6 PB, peak measured

performance ~50GB/s on reads and writes."•  /oasis/projects ~ 400TB"

•  SSD filesystems!•  /scratch local to each native compute node – 300 GB

each."•  /scratch on vSMP node – 4.8TB of SSD based filesystem."

•  NFS filesystems (/home)!10


Gordon Network Architecture!

QDR 40 Gb/s!GbE! 2x10GbE" 10GbE!

3D torus: rail 1 3D torus: rail 2

Mgmt. Nodes (2x)

Mgmt. Edge & Core Ethernet

Public Edge & Core Ethernet

NFS Server (2x)

Compute Node

Compute Node

Compute Node

Data Movers (4x)

Data Oasis Lustre PFS

4 PB

XSEDE & R&E Networks

SDSC Network

IO Nodes

IO Nodes

Login Nodes (4x)

Compute Node 1,024

64

•  Dual-‐rail IB •  Dual 10GbE storage •  GbE management •  GbE public •  Round robin login •  Mirrored NFS •  Redundant front-‐end


Gordon 3D Torus Interconnect Fabric 4x4x4 3D Torus Topology!

IO

CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN

36 Port Fabric Switch

36 Port Fabric Switch

18 x 4X IB Network Connections

18 x 4X IB Network Connections

IO

CN

48GB/sec

48GB/sec

Dual-Rail Network!increased Bandwidth & Redundancy!

Single Connection to each Network!16 Compute Nodes, 2 IO Nodes!

4X4X4 Mesh!Ends are folded on all three!

Dimensions to form a 3DTorus"


Data Oasis Heterogeneous Architecture Lustre-based Parallel File System!

OSS 72TB

64 OSS (Object Storage Servers)

Provide 100GB/s Performance and >4PB Raw Capacity

JBOD 90TB

JBODs (Just a Bunch Of Disks)

Provide Capacity Scale-‐out to an AddiVonal 5.8PB

Arista 7508 10G

Arista 7508 10G

Redundant Switches for Reliability and Performance

3 DisUnct Network Architectures

OSS 72TB

JBOD 90TB

OSS 72TB

JBOD 90TB

OSS 72TB

JBOD 90TB

64 Lustre LNET Routers 100 GB/s

Mellanox 5020 Bridge 12 GB/s

MDS

MDS

Myrinet 10G Switch 25 GB/s

MDS

GORDON IB cluster

TRITON Myrinet cluster

TRESTLES IB cluster

Metadata Servers


Data Oasis from Gordon – Itʼs the Routers!!!

•  Gordon has 64 I/O nodes which host the flash and also serve as routers for the lustre filesystems.!

•  Lustre clients configured to use the local I/O node if available. This maximizes the overall write performance on the system.!

•  Reads round robin over the available routers.!•  Workshop examples illustrate the locality of the

write operations.!

14


Lustre Examples!•  Two example scripts in the xsede12 directory!

•  IOR_lustre_0_hops.cmd – Runs jobs with all nodes on one switch."

•  IOR_lustre_4_hops.cmd – Runs jobs with nodes up to 4 hops away."

•  Example output!•  ior_maxhops0.out – All nodes on same switch and hence

use only *one* router. Max Write - 1796.99 MB/s."•  Ior_maxhops4.out – The nodes ended up on two switches

and hence we had two routers in play during the write. Max Write - 2601.03 MB/s. "

15


Data Oasis Performance!


Model A: One SSD per Compute Node (only 4 of 16 compute nodes shown)!

•  One 300 GB flash drive exported to each compute node appears as a local file system "

•  Lustre parallel file system is mounted identically on all nodes."

"Use cases:"•  Applications that need local, temporary

scratch"•  Gaussian"•  Abaqus"•  Hadoop"

Compute Node" SSD"

Lustre"

Compute Node" SSD"

Compute Node" SSD"

Compute Node" SSD"

Logical View!File system appears as:!

/scratch/$USER/$PBS_JOBID!


Model B: 16 SSDʼs for 1 Compute Node!

•  16 SSDʼs in a RAID0 appear as a single 4.8 TB file system to the compute node."

•  Flash I/O and Lustre traffic uses Rail 1 of the torus."

"Use cases:"•  Database"•  Data mining"•  Gaussian"

4.8 TB"SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

Compute Node"Lustre"

Logical View!

File system appears as:!/scratch/$USER/$PBS_JOBID!


Lustre"

!"

16 node"Virtual

Compute Image"(1 TB)"

!!!4.8 TB file system"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

Model C: 16 SSDʼs within a vSMP Supernode!

•  4.8 TB flash as a single XFS file system"

•  Flash I/O uses both rail 0 and rail 1"

"Use cases:"•  Serial and threaded

applications that need large memory and local disk"

•  Abaqus"•  Genomics (Velvet,

Allpaths, etc)""

Logical View!

Lustre not part of supernode"File system appears as:!

/scratch1/$USER/$PBS_JOBID!(/scratch2 available if using a

32-node supernode)!


Model D: 16 SSDʼs/ 16 compute node – Single Parallel/shared File system (Coming Soon)!

•  16 SSDʼs in a RAID0 appear as a single 4.8 TB file system to all compute nodes"

"Use cases:"•  MPI applications"

Lustre"

Logical View!

4.8 TB"SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

SSD"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"

Compute Node"


Using SSD Scratch (Native Nodes)!#!/bin/bash!#PBS -q normal!#PBS -N ior_native!#PBS -l nodes=1:ppn=16:native!#PBS -l walltime=00:25:00!#PBS -o ior_scratch_native.out!#PBS -e ior_scratch_native.err!#PBS -V!#PBS -M [email protected]!#PBS -m abe!#PBS -A use300!!cd /scratch/$USER/$PBS_JOBID!!mpirun_rsh -hostfile $PBS_NODEFILE -np 4 /oasis/scratch/mahidhar/temp_project/Examples/IOR-gordon -i 1 -F –b 16g -t 1m -v -v > IOR_native_scratch.log!!cp /scratch/$USER/$PBS_JOBID/IOR_native_scratch.log /oasis/scratch/mahidhar/temp_project/Examples!


Using SSD Scratch (Native Nodes)!•  Snapshot on the node during the run:!

$ pwd"/scratch/mahidhar/72251.gordon-fe2.local"$ ls -lt"total 22548292"-rw-r--r-- 1 mahidhar hpss 5429526528 May 15 23:48 testFile.00000001"-rw-r--r-- 1 mahidhar hpss 6330253312 May 15 23:48 testFile.00000003"-rw-r--r-- 1 mahidhar hpss 5532286976 May 15 23:48 testFile.00000000"-rw-r--r-- 1 mahidhar hpss 5794430976 May 15 23:48 testFile.00000002"-rw-r--r-- 1 mahidhar hpss 1101 May 15 23:48 IOR_native_scratch.log"

•  Performance from single node (in log file copied back):!

•  Max Write: 250.52 MiB/sec (262.69 MB/sec)"•  Max Read: 181.92 MiB/sec (190.76 MB/sec)"

22


IOPS – SSD vs Lustre!•  FIO benchmark used to measure random I/O

performance!•  Sample scripts!

•  scratch_native_fio.cmd (uses SSDs)"•  lustre_native_fio.cmd – Note: we will not run this today!

This will overload the meta data server if there are too many simultaneous jobs with lots of random I/O requests. Output from a test run is in ior_lustre_native_fio.out to illustrate the low IOPs."

•  Sample performance numbers:!•  SSD – Random Write : iops=4782, Random Read: 13738"•  Lustre – Random Write: iops=671, Random Read:

iops=101 "23


Which I/O system is right for my application?!Flash-based I/O nodes! Lustre!

Performance" SSDʼs support low latency I/O, high IOPS, and high bandwidth. One SSD can deliver 37K IOPS."Flash resources are dedicated to the user and performance is largely independent of what other users are doing on the system."

Lustre is ubiquitous in HPC. It does well for sequential I/O and files that support I/O to a few files from many cores simultaneously. Random I/O is a Lustre killer."Lustre is a shared resource and performance will vary depending on what other users are doing."

Infrastructure" SSDʼs are deployed in I/O nodes using iSER, an RDMA protocol that is accessed over the InfiniBand network."

64 OSSʼs; distinct file systems and metadata servers; accessed over a 10GbE network via the I/O nodes. Hundreds of HDDs/spindles."

Persistence" Data is generally removed at the end of a run so the resource can be made available to the next job."

Most is deployed as scratch and purgeable by policy (not necessarily at the end of the job."Some deployed as a persistent project storage resource."

Capacity" Up to 4.8 TB per users depending on configuration"

No specific limits or quotas imposed on scratch. File system is ~ 2 PB."

Use cases" Local application scratch (Abaqus, Gaussian); as a data mining platform (e.g., Hadoop); graph problems;"

Traditional HPC I/O associated with MPI applications. Prestaging of data that will be pulled into flash."


vSMP Runtime Guidelines: Overview!

•  Identify type of job – serial (large memory), threaded (pthreads, openmp), or MPI!

•  Workshop directory has examples for the different scenarios. Hands on section will walk through different types.!

•  Use affinity in conjunction with automatic process placement utility (numabind).!

•  Optimized MPI (mpich2 tuned for vSMP) is available.!


vSMP Guidelines for Threaded Codes!

26


Compiling OpenMP Example!•  Change to the workshop directory:!

cd ~/xsede12/GORDON_PART2""

•  Compile using –openmp flag:!ifort -o hello_vsmp -openmp hello_vsmp.f90""

•  Verify executable was created:!ls -lt hello_vsmp"-rwxr-xr-x 1 train61 gue998 786207 May 9 10:31

hello_vsmp"


Hello World on vSMP node (using OpenMP)!•  hello_vsmp.cmd!

#!/bin/bash"#PBS -q vsmp"#PBS -N hello_vsmp"#PBS -l nodes=1:ppn=16:vsmp"#PBS -l walltime=0:10:00"#PBS -o hello_vsmp.out"#PBS -e hello_vsmp.err"#PBS -V"#PBS -M [email protected]"#PBS -m abe"#PBS -A use300"cd /oasis/scratch/mahidhar/temp_project/Examples"export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/0.1/lib64/libvsmpclib.so"export PATH="/opt/ScaleMP/numabind/bin:$PATH""export KMP_AFFINITY=compact,verbose,0,`numabind --offset 8`"export OMP_NUM_THREADS=8"./hello_vsmp"


Hello World on vSMP node (using OpenMP)!

•  Code written using OpenMP! PROGRAM OMPHELLO INTEGER TNUMBER INTEGER OMP_GET_THREAD_NUM !$OMP PARALLEL DEFAULT(PRIVATE) TNUMBER = OMP_GET_THREAD_NUM() PRINT *, 'HELLO FROM THREAD NUMBER = ', TNUMBER !$OMP END PARALLEL STOP END


vSMP OpenMP binding info (from hello_vsmp.err file)!

…"…"…"OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {504}"OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {505}"OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {506}"OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {507}"OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {508}"OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {509}"OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {511}"OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {510}"


Hello World (OpenMP version) Output! HELLO FROM THREAD NUMBER = 1! HELLO FROM THREAD NUMBER = 6! HELLO FROM THREAD NUMBER = 5! HELLO FROM THREAD NUMBER = 4! HELLO FROM THREAD NUMBER = 3! HELLO FROM THREAD NUMBER = 2! HELLO FROM THREAD NUMBER = 0! HELLO FROM THREAD NUMBER = 7!Nodes: gcn-3-11!

31


OpenMP Matrix Multiply Example!#!/bin/bash"#PBS -q vsmp"#PBS -N openmp_mm_vsmp"#PBS -l nodes=1:ppn=16:vsmp"#PBS -l walltime=0:10:00"#PBS -o openmp_mm_vsmp.out"#PBS -e openmp_mm_vsmp.err"#PBS -V"#PBS -M [email protected]"#PBS -m abe"#PBS -A use300"cd /oasis/scratch/mahidhar/temp_project/Examples"# Setting stacksize to unlimited."ulimit -s unlimited"# ScaleMP preload library that throttles down unnecessary system calls."export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/0.1/lib64/libvsmpclib.so"source ./intel.sh"export MKL_VSMP=1"# Path to NUMABIND."export PATH=/opt/ScaleMP/numabind/bin:$PATH"np=8"tag=`date +%s`"# Dynamic binding of OpenMP threads using numabind."export KMP_AFFINITY=compact,verbose,0,`numabind --offset $np`"export OMP_NUM_THREADS=$np"/usr/bin/time ./openmp-mm > log-openmp-nbind-$np-$tag.txt 2>&1 ""

32


vSMP Pthreads Example!cd ~/xsede12/GORDON_PART2!# PATH to numabind!export PATH=/opt/ScaleMP/numabind/bin:$PATH!# ScaleMP preload library that throttles down unnecessary system calls.!export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/0.1/lib64/libvsmpclib.so!# Specify sleep duration for each pthread. Default = 60 sec if not set.!export SLEEP_TIME=30!# 16 pthreads would be created.!NP=16!log=log-$NP-`date +%s`.txt!./ptest $NP >> $log 2>&1 & !# Waiting for 15 seconds for all the threads to start.!sleep 15!echo "ptest threads affinity before numabind" >> $log 2>&1!ps -eLo pid,lwp,time,ucmd,psr | grep ptest >> $log 2>&1!# Start numabind with a config file that has a rule for pthread, !# which would place all threads to consecutive cpus.!numabind --config myconfig >> $log 2>&1!echo "ptest threads affinity after numabind" >> $log 2>&1!ps -eLo pid,lwp,time,ucmd,psr | grep ptest >> $log 2>&1!sleep 300!! 33


Summary, Q/A !!•  Access options – ssh clients, XSEDE User Portal!•  Data Transfer options – scp, globus-url-copy

(gridftp), globus online, and XSEDE User Portal File Manager.!

•  Follow guidelines for serial, OpenMP, Pthreads, MPI jobs on the vSMP nodes.!

•  Lustre routed over I/O nodes. Write performance determined by number of routers used by a job.!

•  Use SSD local scratch where possible. Excellent for codes like Gaussian, Abaqus.!

34

Documents

XSEDE12 – Using Gordon, a Data- Intensive Supercomputer ...Data Oasis from Gordon – Itʼs the Routers!!! • Gordon has 64 I/O nodes which host the ﬂash and also serve as routers