Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
SAN DIEGO SUPERCOMPUTER CENTER
XSEDE12 – Using Gordon, a Data-Intensive Supercomputer (part 2)
Data Transfer, Filesystems, Running on vSMP Nodes
Mahidhar Tatineni
07/16/2012"
1
SAN DIEGO SUPERCOMPUTER CENTER
Data Transfer (scp, globus-url-copy)!• scp is o.k. to use for simple file transfers and
small file sizes (<1GB). Example:!$ scp w.txt [email protected]:/home/train40/w.txt 100% 15KB 14.6KB/s 00:00 "
• globus-url-copy for large scale data transfers between XD resources (and local machines w/ a globus client). !• Uses your XSEDE-wide username and password "• Retrieves your certificate proxies from the central server"• Highest performance between XSEDE sites, uses striping
across multiple servers and multiple threads on each server." 2
SAN DIEGO SUPERCOMPUTER CENTER
Data Transfer – globus-url-copy!• Step 1: Retrieve certificate proxies:!
$ module load globus"$ myproxy-logon –l xsedeusername"Enter MyProxy pass phrase:"A credential has been received for user xsedeusername in /tmp/x509up_u555555.""
• Step 2: Initiate globus-url-copy:!$ globus-url-copy -vb -stripe -tcp-bs 16m -p 4 gsiftp://gridftp.ranger.tacc.teragrid.org:2811///scratch/00342/username/test.tar gsiftp://trestles-dm2.sdsc.xsede.org:2811///oasis/scratch/username/temp_project/test-gordon.tar"Source: gsiftp://gridftp.ranger.tacc.teragrid.org:2811///scratch/00342/username/"Dest: gsiftp://trestles-dm2.sdsc.xsede.org:2811///oasis/scratch/username/temp_project/" test.tar -> test-gordon.tar"
3
SAN DIEGO SUPERCOMPUTER CENTER
Data Transfer – Globus Online!• Works from Windows/Linux/Mac via globus
online website:!• https://www.globusonline.org"
• Gordon and Trestles endpoints already exist. Authentication can be done iusing XSEDE-wide username and password.!
• Globus Connect application (available for Windows/Linux/Mac can turn your laptop/desktop into an endpoint.! 4
SAN DIEGO SUPERCOMPUTER CENTER
Data Transfer – Globus Online!• Step 1: Create a globus online account!
5
SAN DIEGO SUPERCOMPUTER CENTER
Data Transfer – Globus Online!• Step 2: Set up local machine as endpoint using
Globus Connect.!
6
SAN DIEGO SUPERCOMPUTER CENTER
Data Transfer – Globus Online!
7
SAN DIEGO SUPERCOMPUTER CENTER
Data Transfer – Globus Online!
8
• Step 3: Pick Endpoints and Initiate Transfers!!
SAN DIEGO SUPERCOMPUTER CENTER
Data Transfer – Globus Online!
9
SAN DIEGO SUPERCOMPUTER CENTER
Gordon : Filesystems!• Lustre filesystems – Good for scalable large block I/O!
• Accessible from both native and vSMP nodes."• /oasis/scratch/gordon – 1.6 PB, peak measured
performance ~50GB/s on reads and writes."• /oasis/projects ~ 400TB"
• SSD filesystems!• /scratch local to each native compute node – 300 GB
each."• /scratch on vSMP node – 4.8TB of SSD based filesystem."
• NFS filesystems (/home)!10
SAN DIEGO SUPERCOMPUTER CENTER
Gordon Network Architecture!
QDR 40 Gb/s!GbE! 2x10GbE" 10GbE!
3D torus: rail 1 3D torus: rail 2
Mgmt. Nodes (2x)
Mgmt. Edge & Core Ethernet
Public Edge & Core Ethernet
NFS Server (2x)
Compute Node
Compute Node
Compute Node
Data Movers (4x)
Data Oasis Lustre PFS
4 PB
XSEDE & R&E Networks
SDSC Network
IO Nodes
IO Nodes
Login Nodes (4x)
Compute Node 1,024
64
• Dual-‐rail IB • Dual 10GbE storage • GbE management • GbE public • Round robin login • Mirrored NFS • Redundant front-‐end
SAN DIEGO SUPERCOMPUTER CENTER
Gordon 3D Torus Interconnect Fabric 4x4x4 3D Torus Topology!
IO
CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN
36 Port Fabric Switch
36 Port Fabric Switch
18 x 4X IB Network Connections
18 x 4X IB Network Connections
IO
CN
48GB/sec
48GB/sec
Dual-Rail Network!increased Bandwidth & Redundancy!
Single Connection to each Network!16 Compute Nodes, 2 IO Nodes!
4X4X4 Mesh!Ends are folded on all three!
Dimensions to form a 3DTorus"
SAN DIEGO SUPERCOMPUTER CENTER
Data Oasis Heterogeneous Architecture Lustre-based Parallel File System!
OSS 72TB
64 OSS (Object Storage Servers)
Provide 100GB/s Performance and >4PB Raw Capacity
JBOD 90TB
JBODs (Just a Bunch Of Disks)
Provide Capacity Scale-‐out to an AddiVonal 5.8PB
Arista 7508 10G
Arista 7508 10G
Redundant Switches for Reliability and Performance
3 DisUnct Network Architectures
OSS 72TB
JBOD 90TB
OSS 72TB
JBOD 90TB
OSS 72TB
JBOD 90TB
64 Lustre LNET Routers 100 GB/s
Mellanox 5020 Bridge 12 GB/s
MDS
MDS
Myrinet 10G Switch 25 GB/s
MDS
GORDON IB cluster
TRITON Myrinet cluster
TRESTLES IB cluster
Metadata Servers
SAN DIEGO SUPERCOMPUTER CENTER
Data Oasis from Gordon – Itʼs the Routers!!!
• Gordon has 64 I/O nodes which host the flash and also serve as routers for the lustre filesystems.!
• Lustre clients configured to use the local I/O node if available. This maximizes the overall write performance on the system.!
• Reads round robin over the available routers.!• Workshop examples illustrate the locality of the
write operations.!
14
SAN DIEGO SUPERCOMPUTER CENTER
Lustre Examples!• Two example scripts in the xsede12 directory!
• IOR_lustre_0_hops.cmd – Runs jobs with all nodes on one switch."
• IOR_lustre_4_hops.cmd – Runs jobs with nodes up to 4 hops away."
• Example output!• ior_maxhops0.out – All nodes on same switch and hence
use only *one* router. Max Write - 1796.99 MB/s."• Ior_maxhops4.out – The nodes ended up on two switches
and hence we had two routers in play during the write. Max Write - 2601.03 MB/s. "
15
SAN DIEGO SUPERCOMPUTER CENTER
Data Oasis Performance!
SAN DIEGO SUPERCOMPUTER CENTER
Model A: One SSD per Compute Node (only 4 of 16 compute nodes shown)!
• One 300 GB flash drive exported to each compute node appears as a local file system "
• Lustre parallel file system is mounted identically on all nodes."
"Use cases:"• Applications that need local, temporary
scratch"• Gaussian"• Abaqus"• Hadoop"
Compute Node" SSD"
Lustre"
Compute Node" SSD"
Compute Node" SSD"
Compute Node" SSD"
Logical View!File system appears as:!
/scratch/$USER/$PBS_JOBID!
SAN DIEGO SUPERCOMPUTER CENTER
Model B: 16 SSDʼs for 1 Compute Node!
• 16 SSDʼs in a RAID0 appear as a single 4.8 TB file system to the compute node."
• Flash I/O and Lustre traffic uses Rail 1 of the torus."
"Use cases:"• Database"• Data mining"• Gaussian"
4.8 TB"SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
Compute Node"Lustre"
Logical View!
File system appears as:!/scratch/$USER/$PBS_JOBID!
SAN DIEGO SUPERCOMPUTER CENTER
Lustre"
!"
16 node"Virtual
Compute Image"(1 TB)"
!!!4.8 TB file system"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
Model C: 16 SSDʼs within a vSMP Supernode!
• 4.8 TB flash as a single XFS file system"
• Flash I/O uses both rail 0 and rail 1"
"Use cases:"• Serial and threaded
applications that need large memory and local disk"
• Abaqus"• Genomics (Velvet,
Allpaths, etc)""
Logical View!
Lustre not part of supernode"File system appears as:!
/scratch1/$USER/$PBS_JOBID!(/scratch2 available if using a
32-node supernode)!
SAN DIEGO SUPERCOMPUTER CENTER
Model D: 16 SSDʼs/ 16 compute node – Single Parallel/shared File system (Coming Soon)!
• 16 SSDʼs in a RAID0 appear as a single 4.8 TB file system to all compute nodes"
"Use cases:"• MPI applications"
Lustre"
Logical View!
4.8 TB"SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
SSD"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
Compute Node"
SAN DIEGO SUPERCOMPUTER CENTER
Using SSD Scratch (Native Nodes)!#!/bin/bash!#PBS -q normal!#PBS -N ior_native!#PBS -l nodes=1:ppn=16:native!#PBS -l walltime=00:25:00!#PBS -o ior_scratch_native.out!#PBS -e ior_scratch_native.err!#PBS -V!#PBS -M [email protected]!#PBS -m abe!#PBS -A use300!!cd /scratch/$USER/$PBS_JOBID!!mpirun_rsh -hostfile $PBS_NODEFILE -np 4 /oasis/scratch/mahidhar/temp_project/Examples/IOR-gordon -i 1 -F –b 16g -t 1m -v -v > IOR_native_scratch.log!!cp /scratch/$USER/$PBS_JOBID/IOR_native_scratch.log /oasis/scratch/mahidhar/temp_project/Examples!
SAN DIEGO SUPERCOMPUTER CENTER
Using SSD Scratch (Native Nodes)!• Snapshot on the node during the run:!
$ pwd"/scratch/mahidhar/72251.gordon-fe2.local"$ ls -lt"total 22548292"-rw-r--r-- 1 mahidhar hpss 5429526528 May 15 23:48 testFile.00000001"-rw-r--r-- 1 mahidhar hpss 6330253312 May 15 23:48 testFile.00000003"-rw-r--r-- 1 mahidhar hpss 5532286976 May 15 23:48 testFile.00000000"-rw-r--r-- 1 mahidhar hpss 5794430976 May 15 23:48 testFile.00000002"-rw-r--r-- 1 mahidhar hpss 1101 May 15 23:48 IOR_native_scratch.log"
• Performance from single node (in log file copied back):!
• Max Write: 250.52 MiB/sec (262.69 MB/sec)"• Max Read: 181.92 MiB/sec (190.76 MB/sec)"
22
SAN DIEGO SUPERCOMPUTER CENTER
IOPS – SSD vs Lustre!• FIO benchmark used to measure random I/O
performance!• Sample scripts!
• scratch_native_fio.cmd (uses SSDs)"• lustre_native_fio.cmd – Note: we will not run this today!
This will overload the meta data server if there are too many simultaneous jobs with lots of random I/O requests. Output from a test run is in ior_lustre_native_fio.out to illustrate the low IOPs."
• Sample performance numbers:!• SSD – Random Write : iops=4782, Random Read: 13738"• Lustre – Random Write: iops=671, Random Read:
iops=101 "23
SAN DIEGO SUPERCOMPUTER CENTER
Which I/O system is right for my application?!Flash-based I/O nodes! Lustre!
Performance" SSDʼs support low latency I/O, high IOPS, and high bandwidth. One SSD can deliver 37K IOPS."Flash resources are dedicated to the user and performance is largely independent of what other users are doing on the system."
Lustre is ubiquitous in HPC. It does well for sequential I/O and files that support I/O to a few files from many cores simultaneously. Random I/O is a Lustre killer."Lustre is a shared resource and performance will vary depending on what other users are doing."
Infrastructure" SSDʼs are deployed in I/O nodes using iSER, an RDMA protocol that is accessed over the InfiniBand network."
64 OSSʼs; distinct file systems and metadata servers; accessed over a 10GbE network via the I/O nodes. Hundreds of HDDs/spindles."
Persistence" Data is generally removed at the end of a run so the resource can be made available to the next job."
Most is deployed as scratch and purgeable by policy (not necessarily at the end of the job."Some deployed as a persistent project storage resource."
Capacity" Up to 4.8 TB per users depending on configuration"
No specific limits or quotas imposed on scratch. File system is ~ 2 PB."
Use cases" Local application scratch (Abaqus, Gaussian); as a data mining platform (e.g., Hadoop); graph problems;"
Traditional HPC I/O associated with MPI applications. Prestaging of data that will be pulled into flash."
SAN DIEGO SUPERCOMPUTER CENTER
vSMP Runtime Guidelines: Overview!
• Identify type of job – serial (large memory), threaded (pthreads, openmp), or MPI!
• Workshop directory has examples for the different scenarios. Hands on section will walk through different types.!
• Use affinity in conjunction with automatic process placement utility (numabind).!
• Optimized MPI (mpich2 tuned for vSMP) is available.!
SAN DIEGO SUPERCOMPUTER CENTER
vSMP Guidelines for Threaded Codes!
26
SAN DIEGO SUPERCOMPUTER CENTER
Compiling OpenMP Example!• Change to the workshop directory:!
cd ~/xsede12/GORDON_PART2""
• Compile using –openmp flag:!ifort -o hello_vsmp -openmp hello_vsmp.f90""
• Verify executable was created:!ls -lt hello_vsmp"-rwxr-xr-x 1 train61 gue998 786207 May 9 10:31
hello_vsmp"
SAN DIEGO SUPERCOMPUTER CENTER
Hello World on vSMP node (using OpenMP)!• hello_vsmp.cmd!
#!/bin/bash"#PBS -q vsmp"#PBS -N hello_vsmp"#PBS -l nodes=1:ppn=16:vsmp"#PBS -l walltime=0:10:00"#PBS -o hello_vsmp.out"#PBS -e hello_vsmp.err"#PBS -V"#PBS -M [email protected]"#PBS -m abe"#PBS -A use300"cd /oasis/scratch/mahidhar/temp_project/Examples"export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/0.1/lib64/libvsmpclib.so"export PATH="/opt/ScaleMP/numabind/bin:$PATH""export KMP_AFFINITY=compact,verbose,0,`numabind --offset 8`"export OMP_NUM_THREADS=8"./hello_vsmp"
SAN DIEGO SUPERCOMPUTER CENTER
Hello World on vSMP node (using OpenMP)!
• Code written using OpenMP! PROGRAM OMPHELLO INTEGER TNUMBER INTEGER OMP_GET_THREAD_NUM !$OMP PARALLEL DEFAULT(PRIVATE) TNUMBER = OMP_GET_THREAD_NUM() PRINT *, 'HELLO FROM THREAD NUMBER = ', TNUMBER !$OMP END PARALLEL STOP END
SAN DIEGO SUPERCOMPUTER CENTER
vSMP OpenMP binding info (from hello_vsmp.err file)!
…"…"…"OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {504}"OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {505}"OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {506}"OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {507}"OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {508}"OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {509}"OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {511}"OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {510}"
SAN DIEGO SUPERCOMPUTER CENTER
Hello World (OpenMP version) Output! HELLO FROM THREAD NUMBER = 1! HELLO FROM THREAD NUMBER = 6! HELLO FROM THREAD NUMBER = 5! HELLO FROM THREAD NUMBER = 4! HELLO FROM THREAD NUMBER = 3! HELLO FROM THREAD NUMBER = 2! HELLO FROM THREAD NUMBER = 0! HELLO FROM THREAD NUMBER = 7!Nodes: gcn-3-11!
31
SAN DIEGO SUPERCOMPUTER CENTER
OpenMP Matrix Multiply Example!#!/bin/bash"#PBS -q vsmp"#PBS -N openmp_mm_vsmp"#PBS -l nodes=1:ppn=16:vsmp"#PBS -l walltime=0:10:00"#PBS -o openmp_mm_vsmp.out"#PBS -e openmp_mm_vsmp.err"#PBS -V"#PBS -M [email protected]"#PBS -m abe"#PBS -A use300"cd /oasis/scratch/mahidhar/temp_project/Examples"# Setting stacksize to unlimited."ulimit -s unlimited"# ScaleMP preload library that throttles down unnecessary system calls."export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/0.1/lib64/libvsmpclib.so"source ./intel.sh"export MKL_VSMP=1"# Path to NUMABIND."export PATH=/opt/ScaleMP/numabind/bin:$PATH"np=8"tag=`date +%s`"# Dynamic binding of OpenMP threads using numabind."export KMP_AFFINITY=compact,verbose,0,`numabind --offset $np`"export OMP_NUM_THREADS=$np"/usr/bin/time ./openmp-mm > log-openmp-nbind-$np-$tag.txt 2>&1 ""
32
SAN DIEGO SUPERCOMPUTER CENTER
vSMP Pthreads Example!cd ~/xsede12/GORDON_PART2!# PATH to numabind!export PATH=/opt/ScaleMP/numabind/bin:$PATH!# ScaleMP preload library that throttles down unnecessary system calls.!export LD_PRELOAD=/opt/ScaleMP/libvsmpclib/0.1/lib64/libvsmpclib.so!# Specify sleep duration for each pthread. Default = 60 sec if not set.!export SLEEP_TIME=30!# 16 pthreads would be created.!NP=16!log=log-$NP-`date +%s`.txt!./ptest $NP >> $log 2>&1 & !# Waiting for 15 seconds for all the threads to start.!sleep 15!echo "ptest threads affinity before numabind" >> $log 2>&1!ps -eLo pid,lwp,time,ucmd,psr | grep ptest >> $log 2>&1!# Start numabind with a config file that has a rule for pthread, !# which would place all threads to consecutive cpus.!numabind --config myconfig >> $log 2>&1!echo "ptest threads affinity after numabind" >> $log 2>&1!ps -eLo pid,lwp,time,ucmd,psr | grep ptest >> $log 2>&1!sleep 300!! 33
SAN DIEGO SUPERCOMPUTER CENTER
Summary, Q/A !!• Access options – ssh clients, XSEDE User Portal!• Data Transfer options – scp, globus-url-copy
(gridftp), globus online, and XSEDE User Portal File Manager.!
• Follow guidelines for serial, OpenMP, Pthreads, MPI jobs on the vSMP nodes.!
• Lustre routed over I/O nodes. Write performance determined by number of routers used by a job.!
• Use SSD local scratch where possible. Excellent for codes like Gaussian, Abaqus.!
34