Upload
hosea
View
74
Download
7
Embed Size (px)
DESCRIPTION
IFS Benchmark with Federation Switch. John Hague, IBM. Introduction. Federation has dramatically improved pwr4 p690 communication, so Measure Federation performance with Small Pages and Large Pages using simulation program Compare Federation and pre-Federation (Colony) performance of IFS - PowerPoint PPT Presentation
Citation preview
IFS Benchmark with Federation Switch
John Hague, IBM
Introduction
• Federation has dramatically improved pwr4 p690 communication, so
– Measure Federation performance with Small Pages and Large Pages using simulation program
– Compare Federation and pre-Federation (Colony) performance of IFS
– Compare Federation performance of IFS with and without Large Pages and Memory Affinity
– Examine IFS communication using mpi profiling
Colony v Federation
• Colony (hpca)
– 1.3GHz 32-processor p690s– Four 8-processor Affinity LPARs per p690
• Needed to get communication performance
– Two 180MB/s adapters per LPAR
• Federation (hpcu)
– 1.7GHz p690s– One 32-processor LPAR per p690– Memory and MPI MCM Affinity
• MPI Task and Memory from same MCM• Slightly better than binding task to specific processor
– Two 2-link 1.2GB/s Federation adapters per p690 • Four 1.2GB/s links per node
IFS Communication:transpositions
1. MPI Alltoall in all rows simultaneously• Mostly shared memory
2. MPI Alltoall in all columns simultaneously
1
0 MPI task
Node
1
4
21 2 3 4 5
8 9
31308
3
0
Simulation of transpositions
• All transpositions in “row” use shared memory• All transpositions in “column” use switch• Number of MPI tasks per node varied
– But all processors used by using OpenMP threads
• Bandwidth measured for MPI Sendrecv calls– Buffers allocated and filled by threads between each call
• Large Pages give best switch performance– With current switch software
“Transposition” Bandwidth per link(8 nodes, 4 links/node, 8 tasks/node, 4 threads/task, 2 tasks/link)
0
200
400
600
800
1000
1200
1400
1600
100 1000 10000 100000 1000000 10000000
Bytes
MB
/sec
LP: EAGER_LIMIT=64K
LP: MIN_BULK=50K
LP: BASE
SP
SP = Small PagesLP = Large Pages
“Transposition” Bandwidth per link(8 nodes, 4 links/node)
0
200
400
600
800
1000
1200
1400
1600
100 1000 10000 100000 1000000 10000000
Bytes
MB
/sec
32 tasks
16 tasks
8 tasks
4 tasks
Multiple threads ensure all processors are used
hpcu v hpca with IFS
• Benchmark jobs (provided 3 years ago)
– Same executable used on hpcu and hpca– 256 processors used– All jobs run with mpi_profiling (and barriers before data
exchange)
Procs Grid Points hpca hpcu Speedup
T399 10x1_4 213988 5828 3810 1.52
T799 16x8_2 843532 9907 5527 1.79
4D-Var
T511/T255
16x8_2 4869 2737 1.78
IFS Speedups: hpcu v hpca
11.11.21.31.41.51.61.71.81.9
2
SP no MA SP w MA LP no MA LP w MA
Total
spee
dup
v hp
ca
799
399
4D-Var
1
2
3
4
5
SP no MA SP w MA LP no MA LP w MA
Communication
sp
ee
du
p v
hp
ca
799
399
4D_Var
11.11.21.31.41.51.61.71.81.9
2
SP no MA SP w MA LP no MA LP w MA
CPU
spee
du
p v
hp
ca
799
399
4D-Var
LP = Large Pages; SP = Small PagesMA = Memory Affinity
LP/SP & MA/noMA CPU comparison
-5
0
5
10
15
LP/SP no MA LP/SP w MA MA/noMA w SP MA/noMA w LP
Pe
rce
nta
ge
799
399
4D-Var
LP/SP & MA/noMA Comms comparison
-20
0
20
40
60
80
100
120
LP/SP no MA LP/SP w MA MA/noMA w SP MA/noMA w LP
Pe
rce
nta
ge
799
399
4D-Var
Percentage Communication
05
101520253035404550
SP no MA SP no MA SP w MA LP no MA LP w MA
Pe
rce
nta
ge
799
399
4D-Var
hpca ------------------- hpcu --------------------------
Extra Memory needed by Large Pages
Large Pages are allocated in Real Memory in segments of 256 MB
• MPI_INIT– 80MB which may not be used – MP_BUFFER_MEM (default 64MB) can be reduced– MPI_BUFFER_ALLOCATE needs memory which may not be used
• OpenMP threads:– Stack allocated with XLSMPOPTS=“stack=…” may not be used
• Fragmentation – Memory is "wasted"
• Last 256 MB segment– Only a small part of it may be used
mpi_profile
• Examine IFS communication using mpi profiling
– Use libmpiprof.a
– Calls and MB/s rate for each type of call• Overall• For each higher level subroutine
– Histogram of blocksize for each type of call
mpi_profile for T799
128 MPI tasks, 2 threadsWALL time = 5495 sec--------------------------------------------------------------MPI Routine #calls avg. bytes Mbytes time(sec) --------------------------------------------------------------MPI_Send 49784 52733.2 2625.3 7.873MPI_Bsend 6171 454107.3 2802.3 1.331MPI_Isend 84524 1469867.4 124239.1 1.202MPI_Recv 91940 1332252.1 122487.3 359.547MPI_Waitall 75884 0.0 0.0 59.772MPI_Bcast 362 26.6 0.0 0.028MPI_Barrier 9451 0.0 0.0 436.818 -------TOTAL 866.574 ----------------------------------------------------------------
Barrier indcates load imbalance
mpi_profile for 4D_Var min0
128 MPI tasks, 2 threadsWALL time = 1218 sec--------------------------------------------------------------MPI Routine #calls avg. bytes Mbytes time(sec)--------------------------------------------------------------MPI_Send 43995 7222.9 317.8 1.033MPI_Bsend 38473 13898.4 534.7 0.843MPI_Isend 326703 168598.3 55081.6 6.368MPI_Recv 432364 127061.8 54936.9 220.877 MPI_Waitall 276222 0.0 0.0 23.166MPI_Bcast 288 374491.7 107.9 0.490MPI_Barrier 27062 0.0 0.0 94.168MPI_Allgatherv 466 285958.8 133.3 26.250MPI_Allreduce 1325 73.2 0.1 1.027 -------TOTAL 374.223-----------------------------------------------------------------
Barrier indicates load imbalance
MPI Profiles for send/recv
0
5000
10000
15000
20000
25000
30000
KBytes
Cal
ls
799
0
1000
2000
3000
4000
5000
6000
7000
8000
KBytes
Cal
ls
399 hpca
0
20000
40000
60000
80000
100000
120000
KBytes
Ca
lls
4d_var min0
mpi_profiles for recv/send
Avg
MB
MB/s per task
hpca hpcu
T799 (4 tasks per link)
trltom (inter node) 1.84 35 224
trltog (shrd memory) 4.00 116 890
slcomm2 (halo) 0.66 65 363
4D-Var min0 (4 tasks per link)
trltom (inter node) 0.167 160
trltog (shrd memory) 0.373 490
slcomm2 (halo) 0.088 222
Conclusions
• Speedups of hpcu over hpca Large Memory Pages Affinity Speedup N N 1.32 – 1.60 Y N 1.43 – 1.62 N Y 1.47 – 1.78 Y Y 1.52 – 1.85
• Best Environment Variables
– MPI.network=ccc0 (instead of cccs)– MEMORY_AFFINITY=yes– MP_AFFINITY=MCM ! With new pvmd– MP_BULK_MIN_MSG_SIZE=50000– LDR_CNTRL="LARGE_PAGE_DATA=Y“ don’t use – else system calls in LP very slow – MP_EAGER_LIMIT=64K
hpca v hpcu
------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0: hpca *** N N 2499 1408 1091 43.6 hpcu H+/22 N N 1502 1119 383 1.66 1.26 2.85 25.5 H+/21 N Y 1321 951 370 1.89 1.48 2.95 28.0 H+/20 Y N 1444 1165 279 1.73 1.21 3.91 19.3 H+/19 Y Y 1229 962 267 2.03 1.46 4.08 21.7 min1: hpca *** N N 1649 1065 584 43.6 hpcu H+/22 N N 1033 825 208 1.60 1.29 2.81 20.1 H+/21 N Y 948 734 214 1.74 1.45 2.73 22.5 H+/15 Y N 1019 856 163 1.62 1.24 3.58 16.0 H+/19 Y Y 914 765 149 1.80 1.39 3.91 16.3
hpca v hpcu
------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0: hpca *** N N 2499 1408 1091 43.6 hpcu H+/22 N N 1502 1119 383 1.66 1.26 2.85 25.5 H+/21 N Y 1321 951 370 1.89 1.48 2.95 28.0 H+/20 Y N 1444 1165 279 1.73 1.21 3.91 19.3 H+/19 Y Y 1229 962 267 2.03 1.46 4.08 21.7 min1: hpca *** N N 1649 1065 584 43.6 hpcu H+/22 N N 1033 825 208 1.60 1.29 2.81 20.1 H+/21 N Y 948 734 214 1.74 1.45 2.73 22.5 H+/15 Y N 1019 856 163 1.62 1.24 3.58 16.0 H+/19 Y Y 914 765 149 1.80 1.39 3.91 16.3
mpi_profiles for recv/send
Avg
MB
MB/s per task
hpca hpcu
T799 (4 tasks per link)
trltom (inter node) 1.84 35 224
trltog (shrd memory) 4.00 116 890
slcomm2 (halo) 0.66 65 363
4D-Var min0 (4 tasks per link)
trltom (inter node) 0.167 160
trltog (shrd memory) 0.373 490
slcomm2 (halo) 0.088 222
Conclusions
• Memory Affinity with binding– Program binds to: MOD(task_id*nthrds+thrd_id,32), or– Use new /usr/lpp/ppe.poe/bin/pmdv4 – How to bind if whole node not used– Try VSRAC code from Montpellier– Bind adapter link to MCM ?
• Large Pages – Advantages
• Need LP for best communication B/W with current software – Disadvantages
• Uses extra memory (4GB more per node in 4D-Var min1)• Load Leveler Scheduling
– Prototype switch software indicates Large Pages not necessary
• Collective Communication– To be investigated
Linux compared to PWR4 for IFS
• Linux (run by Peter Mayes)– Opteron, 2GHz, 2 CPUs/node, 6GB/node, myrinet switch– Portland Group compiler:– Compiler flags: -O3 -Mvect=sse– No code optimisation or OpenMP– Linux 1: 1 CPU/node, Myrinet IP– Linux 1A: 1 CPU/node, Myrinet GM– Linux 2: using 2 CPUs/node
• IBM Power4– MPI (intra-node shared memory) and OpenMP– Compiler flags: -O3 –qstrict– hpca: 1.3GHz p690, 8 CPUs/node, 8GB/node, colony switch– hpcu: 1.7GHz p690, 32 CPUs/node, 32GB/node, federation switch
Linux compared to Pwr4
0
100
200
300
400
500
600
700
800
900
1000
0 10 20 30 40 50 60 70
Processors
Se
c fo
r S
tep
s 1
to 1
1
T511 Linux 1
T511 Linux 1A
T511 hpca
T511 hpcu
T159 Linux 2
T159 Linux 1
T159 hpca
T159 hpcu