Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Application Performance in the Multi-Core Era –
Lessons to Be Learned for Exascale Computing
(“Tales from the trenches”)
Gerhard WelleinDepartment for Computer Science Erlangen Regional Computing CenterFriedrich-Alexander-University Erlangen-Nuremberg
SIAM Conference on Computational Science and Engineering 2011
2March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Computers - Pandora’s Box?
CSE has extremely benefited technologicaladvancements insemiconductor technology
Focus of CSE isoften on Applications & Methods Computational Science
and Engineering
Application
Num. Methods/ MathComputer
Ken Olsen (DEC):“Software comes from heaven when you have good hardware”
3March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Frequency [MHz]
1
10
100
1000
10000
1971
1974
1977
1980
1983
1986
1989
1992
1995
1998
2001
2004
2009
Year
No reason to touch Pandora’s box until 2004/5
Exponential growth of CPU clock speed for 15+ years
Intel x86 clock speed
Better architectural features:
• Pipelining
• Superscalarity
• SIMD / Vector ops
• Larger caches
BUT
No fundamental architectural
changes until 2004/5
4March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
The driving force behind: Moore’s law
Electronics Magazine, April 1965: The complexity for minimum component costs has increased at a rate of roughly a factor of two per year… Certainly over the short term this rate can be expected to continue, if not to increase.
Intel Nehalem EX: 2.3 Billion
nVIDIA FERMI: 3 Billion
Intel Corp
5March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Trading single thread performance for parallelism
Power consumption limits clock speed: P ~ f2 (worst case ~f3) Core supply voltage approaches a lower limit: VC ~ 1VTDP approaches economical limit: TDP ~ 80 W,…,130 W
Moore’s law is still valid…more cores + new on-chip functionality (PCIe; GPU)
Be prepared for more cores with lower complexity and clock speed!
P5 / 80586 (1993) Pentium3 (1999) Pentium4 (2003) Core i7–960 (2009)
66 MHz 600 MHz 2800 MHz 3200 MHz
16 W @ VC = 5 V 23 W @ VC = 2 V 68 W @ VC = 1.5 V 130 W @ VC = 1.3
800 nm / 3 M 250 nm / 28 M 130 nm / 55 M 45 nm / 730 M
Process technology / Number of transistors in million
TDP / Core supply voltage
Quad-Core
Parallelism and GPGPUs – driving the HPC ecosystem from PFlop/s to EFlop/s
7March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Rank Site Manufacturer Computer Country Cores Rmax[Tflops]
Power[MW]
1 National SuperComputer Center in Tianjin NUDT
Tianhe-1ANUDT TH MPP,
Xeon 6C, NVidia, FT-1000 8CChina 186,368 2,566 4.04
2 Oak Ridge National Laboratory Cray Jaguar
Cray XT5, HC 2.6 GHz USA 224,162 1,759 6.95
3 National Supercomputing Centre in Shenzhen Dawning
NebulaeTC3600 Blade, Intel X5650,
NVidia Tesla C2050 GPU China 120,640 1,271 2.58
4 GSIC, Tokyo Institute of Technology NEC/HP
TSUBAME-2HP ProLiant, Xeon 6C, NVidia Japan 73,278 1,192 1.40
5 DOE/SC/LBNL/NERSC Cray Hopper
Cray XE6, 6C 2.1 GHz USA 153,408 1.054 2.91
6 Commissariat a l'Energie Atomique (CEA) Bull Tera 100
Bull bullx super-node S6010/S6030 France 138.368 1,050 4.59
7 DOE/NNSA/LANL IBMRoadrunner
BladeCenter QS22/LS21 USA 122,400 1,042 2.34
8 University of Tennessee Cray KrakenCray XT5 HC 2.36GHz USA 98,928 831.7 3.09
9 Forschungszentrum Juelich (FZJ) IBM Jugene
Blue Gene/P Solution Germany 294,912 825.5 2.26
10 DOE/NNSA/LANL/SNL Cray Cielo
Cray XE6, 6C 2.4 GHz USA 107,152 816.6 2.95
TOP 10 supercomputers (Nov. 2010)
8March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Performance projection based on TOP500
Large investments in hardware AND software are on the way to enable ExaFlop/s (1018 FLOP/second) computers by the end of the decade
0,1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1000000000
1994 1998 2002 2006 2010 2014 2018
1 Gflop/s
1 Tflop/s
1 Pflop/s
N=1
N=500
1 Eflop/s
Intel Desktop
9March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Potential Future Architectures
Systems 2009 2015 +1/-0 2018 +1/-0
System peak 2 Peta 100-300 Peta 1 Exa
Power 6 MW ~15 MW ~20 MW
System memory 0.3 PB 5 PB 64 PB (+)
Node performance 125 GF 0.5 TF or 7 TF 1-2 or 10TF
Node memory BW 25 GB/s 1-2TB/s 2-4TB/s
Node concurrency 12 O(100) O(1k) or 10k
Total Node Interconnect BW 3.5 GB/s 100-200 GB/s 200-400GB/s
System size (nodes) 18,700 50,000 or 500,000 O(100,000) or O(1M)
Total concurrency 225,000 O(100,000,000) *O(10)-O(50) to hide latency
O(billion) * O(10) to O(100) for latency hiding
Storage 15 PB 150 PB 500-1000 PB (>10x system memory is min)
IO 0.2 TB 10 TB/s 60 TB/s (how long to drain the machine)
MTTI days O(1day) O(1 day) By
cour
tesy
of J
ohn
Sha
lf (L
BN
L) /
DO
E E
xasc
ale
Ste
erin
g C
omm
ittee
10March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Trends towards ExaFlops computing
1. Reduced single core complexity Single thread performance will be crucial Trade flexibility for performance
2. Memory gap and complexity further increases Don’t forget basic main memory access optimization! We are living in a world of ccNUMA architectures Data locality effectsMulti-core processors have shared caches Why do we ignore them?
3. Are GPGPU-like architecture the holy grail?!
4. Total concurrency O(109)MPI has not been designed for that level of concurrency O(102) concurrency in 1993 New programming models?
1. Reduced single thread performance – trading flexibility for performance
A simple question with religious dimensions..
What is the right programming language/style?!
C / C++ / Java / FORTRAN ?
12March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Impact of programming language/styleA representative example ?!
Sparse matrix-vector multiplication (e.g. Finite-Element Method)y = y + M*x
The classical Fortran/C approach: Compressed Row Storage (CRS) for matrix M
Sparse MVM code snippet:
for(i = 0; i < number_of_unknowns; ++i){
for(j = row(i); i < row(i+1);++j){
y[i] = y[i] + entry[j] *x[column[j]];
}}}
13March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Impact of programming language/styleA representative example?!
Sparse matrix-vector multiplication (e.g. Finite-Element Method)y = y + M*x
The object oriented C++ approach: Consider each row of M as an object, e.g. the stencil of a node in FEM Matrix M is a vector of “stencils”
for(i=0; i < number_of_unknowns; ++i){for(j=0; j < stencil_array[i].m_row_stencil_length;++j){
y.m_vektor[i] = y.m_vektor[i] + stencil_array[i].m_row_stencil[j] * x.m_vektor[stencil_array[i].m_row_position[j]];
}}}
//Class Stencilclass Stencil{ int m_row_stencil_length; double *m_row_stencil; int *m_row_position; }
14March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Impact of programming language/styleA representative example?!
Sparse matrix-vector multiplication (e.g. Finite-Element Method)y = y + M*x
Problem: FEM on semi structured grid with 55056 vertices
Performance ofsimple CG solverincluding spMVM
Testmachine:Intel Core2 2 .26GHz(Penryn)
82
610
0
100
200
300
400
500
600
700
MFl
op/s
Object Oriented Classical CRS
8 X
15March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
A few comments on programming language/style
Don’t be religious about the programming style!
Adopt the style to the problem and remember:
PERFORMANCE * FLEXIBILITY = constant
IF you program C++ like Fortran Fortran performance
BUT do students still know Fortran?!
2. Memory gap and complexity further increases
MEMORY MEMORY
17March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Boltzmann Equation
Discretization of particle velocity space
][1 )0(fffft −−=∇⋅+∂λ
ξ timerelaxation ...
functionon distributi mequilibriu ... velocityparticle...
)0(
λ
ξ
f
][1 )(eqt ffff ααααα λ
ξ −−=∇⋅+∂ t),,(),(
t),,(),()0()(
αα
αα
ξ
ξ
xftxf
xftxfeq rr
rr
=
=
ξα – determined by discretization scheme
Don’t forget main memory access optimizationsLattice Boltzmann method
),(~),(
)],(),([),(),(~ )(
txftttexf
txftxftxftxf
ii
ieq
iii
ααα
αααα
δδ
ω
=++
−−=r
collision step:
streaming step:
Often programming follows the physical equations, but doing collision and streaming separately doubles main memory transfer!
18March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Don’t forget main memory access optimizationsThis is not just fun with some kernels…
waLBerla: Widely applicable LB solver from Erlangen (Uli Rüde’s group)
waLBerla (C++)
Code management, standard
implementations Low‐levelkernels foroptimized
architecture‐specific
computations (in C++, CUDA,
Assembler)
19March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Don’t forget main memory access optimizationsLattice Boltzmann method
double precision F(0:18,0:iMax+1,0:jMax+1,0:kMax+1,0:1)do k=1,kMax
do j=1,jMaxdo i=1,iMax
if( fluidcell(x,y,z) ) thenLOAD F(0:18,i,j,k,t)Relaxation (complex computations)SAVE F( 0,i ,j ,k ,t+1)SAVE F( 1,i+1,j+1,k ,t+1)SAVE F( 2,i ,j+1,k ,t+1)SAVE F( 3,i-1,j+1,k ,t+1)…SAVE F(18,i ,j-1,k-1,t+1)
endifenddo
enddoenddo
LD 1-2 Cachelines (cont. access)
LD & ST 19 Cachelines
Collide Step
Stream Step
#load from main memory: 19*iMax*jMax*kMax + 19*iMax*jMax*kMax
#store to main memory: 19*iMax*jMax*kMax
If cache line of store operation is not in cache it must be loaded first (“write allocate”)!
Data layoutF( Q , I , J , K)
20March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Don’t forget main memory access optimizationsEnsure spatial locality
D3Q19 stencil
3 (I*J) planes fit into L3 cache size / (2 * threads)
21March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Don’t forget main memory access optimizationsChoice of data layout and “streaming stores”
F( I , J, K , Q ) data layout:38 cache lines need to be stored in cache
Save 1/3 of datatransfer usingstreaming stores(NT-stores) bypassingcache
Effective memorybandwidth as measuredwith STREAM setsan ultimate limit
First version of CE Master student (2nd year) (seminar code)MS109Thursday
22March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
k-direction
j-dire
ctio
n
core0 core1
Cache
Memory
x
do t=1,tMax!$OMP PARALLEL DO private(…)
do k=1,Ndo j=1,N
do i=1,Ny(i,j,k) = …
enddoenddo
enddoy x
enddo
Multi-Core specific features – Room for new ideasThe classical parallelization approach
23March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
z-direction
y-di
rect
ion
core0: x(:,:,k-1:k+1)t tmp(:,:,mod(k,4))
core1: tmp(:,:,mod(k-3,4):mod(k-1,4)) x(:,:,k-2)t+2
core0 core1
tmp(:,:,3)
Memory
x(:,:,:)
Reduce main memory accesses by a factor 2!
y(:,:,:) is obsolete!
Save main memory data transfers for y(:,:,:)!
Use ring buffer tmp(:,:, 0:3)which fits into the cache
Sync threads/cores after each k-iteration
Multi-Core specific features – Room for new ideasMake use of shared resources: parallel wavefronts
24March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Multi-Core specific features – Room for new ideasWavefront parallelization of Gauß-Seidel solver
Shared caches in Multi-Core processorsFast thread synchronizationFast access to shared data structures
FD discretization of 3D Laplace equation:Parallel lexicographical Gauß-Seidel using pipeline approach (“threaded”)Combine threaded approach with wavefront technique (“wavefront”)
wavefront
threaded
02000400060008000
1000012000140001600018000
1 2 4 8
threadedwavefront
Threads
MFL
OP/
s
Intel Core i7-2600
3.4 GHz; 4 cores
SMT
25March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
STREAM benchmark on 12-core Intel Westmere:Data locality: Anarchy vs. thread pinning
No pinning
Pinning (physical cores first)Thread/process – core affinity is
critical in ccNUMA environments
CC
CC
CC
CC
CC
CC
C
MI
Memory
PT0
T1P
T0
T1P
T0
T1P
T0
T1P
T0
T1P
T0
T1
CC
CC
CC
CC
CC
CC
C
MI
Memory
PT0
T1P
T0
T1P
T0
T1P
T0
T1P
T0
T1P
T0
T1
Open source toolkit LIKWIDhttp://code.google.com/p/likwid
3. Are GPGPU-like architecture the holy grail?!
No HPC conference without a vast number of GPGPU submissions…
27March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Trading single thread performance for parallelismGPGPUs vs. CPUs
GPU vs. CPU light speed estimate:
1. Compute bound: 4-5 X2. Memory Bandwidth: 2-5 X
Intel Core i5 – 2500 (“Sandy Bridge”)
Intel X5650 DP node (“Westmere”)
NVIDIA C2070 (“Fermi”)
Cores@Clock 4 @ 3.3 GHz 2 x [email protected] GHz 448 @ 1.1 GHzPerformance+/core 52.8 GFlop/s 21.3 GFlop/s 2.2 GFlop/sThreads@stream 4 12 8000 +
Total performance+ 210 GFlop/s 255 GFlop/s 1,000 GFlop/sStream BW 17 GB/s 41 GB/s 90 GB/s (ECC=1)
Transistors / TDP 1 Billion* / 95 W 1.17 Billion / 95 W 3 Billion / 238 W
* Inc
lude
s on
-chi
p G
PU
and
PC
I-Exp
ress
+ Single Precision
28March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
GPGPU application performanceLattice Boltzmann solver “walberla”
2.2 X
MS109Thursday
29March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
GPGPUs: MD simulation with AMBER The continuous pain of long term simulation
Long term MD simulation for the human nuclear receptor hNUR7726167 atoms Parallel scalability(233 amino acids, 2 chlorine counter ions, 7486 water molecules) AMBER / pmemdTotal simulation time: 2.47μs
Performance: ns/day matters
Y2004: 18 nodes (2 x Intel Xeon 3.2 GHz) with SDR IB: 8.6 ns/day
Y2010: 2 x 6-core Intel Xeon 2.66 GHz nodes with QDR IB1 node (12 cores) 9.8 ns/day2 nodes (24 cores) 16.2 ns/day4 nodes (48 cores) 25.4 ns/day
Amber/11-p11-intel11.1-intelmpi4.0
Y2010: nVIDIA C2070 (“Fermi”)(Mixed SPDP precision)
1 x C2070 (ECC=0) 14.0 ns/day1 x C2070 (ECC=1) 12.1 ns/day
2 x C2070 (ECC=1) 16.5 ns/day
Amber11-p12-cuda3.2-openmpi1.4
30March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
GPGPU application performance Computer Tomography reconstruction
General motivation: Reconstruction of 3D data from 2D X-ray projection images.
Here: X-ray projections acquired during an intervention.
Reconstruction should be as fast as possible:
Interventional: Reduce time patient lying on tableTime resolved (4D) reconstruction
Method: 3D cone beam reconstruction of high resolution C-arm Computer Tomography dataset
Courtesy of J. Hornegger, H. Hofmann
31March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
GPGPU application performance RabbitCT – 3-D reconstruction: open competition
RabbitCTOpen platform for performance-comparison
of back projection implementations based on a high resolution C-arm CT dataset of a rabbit
Everyone is invited to submit results
Department of Neuroradiology and Pattern Recognition Lab; Univ. Erlangen-Nuremberg
References:http://www.rabbitct.com/http://code.google.com/p/rabbitct/
32March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
GPGPU application performance RabbitCT – a perfect match for GPUs only?
Original ranking (http://www5.informatik.uni-erlangen.de/research/projects/rabbitct/ranking/?ps=256)1. NVIDIA GeForce GTX 260 (OpenCl) 5.484 ms/view2. NVIDIA Quadro FX5800 (CUDA) 9.593 ms/view3. Intel Core2 Duo T9300 @ 2.50GHz 1046.245 ms/view
Testcase:
• 2563 voxels
• 496 “views”, i.e. 2-D CT images,each 4.5 MB
• Cache bound for CPU
• Single precision arithmetic
Specific GPU hardware can be used in some operations (bilinear interpolation)!
33March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
GPGPU application performance RabbitCT – a perfect match for GPUs only?
Best available GPU number (NVIDIA GTX 260): 5.48 ms/view
Challenger: Intel Core i5 – 2600 (3.4 GHz; 4 cores; SMT) Peak Perf. (single precision): 217.6 GFlop/s (with AVX)
(108.8 GFlop/s (with SSE))
8 threads
C++ Textbook 44.67 ms/view
SSE intrinsics with advanced
optimizations23.78 ms/view
SSE assembly kernel (1 week) 15.28 ms/view
70 X speed-up vs. old CPU reference number (cf. prev. slide)AVX assembly is ongoing work
More powerful CPU systems
Intel Westmere ([email protected] GHz)7.63 ms/view (24 threads)
Intel Nehalem EX ([email protected] GHz) 4.25 ms/view (64 threads)
But this is a 30k+ USD system…
34March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
If you were plowing a field, which would you rather use?
Two strong oxen or 1024 chickens?(Attributed to Seymour Cray)
GPGPUs: continuing a long term trend?
4. New parallel programming models
The end of MPI?
The rise of UPC, CoArray Fortran, Chapel, X10,…?
36March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
The end of MPI?“MPI + X” should be do the job in most cases
Why MPI?After 20 yrs it is still not perfect, but it worksIndustry finally did accept it - after some timeEfficiency does not depend on specific hardware or compiler capabilities
Hybrid “MPI + X” approach:Use MPI at coarsest parallel level and slowest communication pathUse another parallel programming model X=OpenMP, UPC, CoArray, CUDA/GPU at intermediate (preferably shared memory) level And do not forget massive SIMD architectures
Intel’s AVX performs 4 double precision floating point ops per instructionIntel’s MIC (formerly known as Larabee) can operate on 512-Bit registers & instructions
37March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
MPI + X programming modelHybrid MPI+OpenMP
If you code scales perfectly with MPI – do not touch it!
Hybrid MPI+OpenMP is beneficial e.g.if load imbalance limits scalabilityif a separate communication thread is advantageousIf physical problem limits MPI scalability…
Parallel spMVMProblem: Time evolution of finite quantum systems (“Holstein model”)DMatrix= 6.2 *106
NNZ = 100*106
Competition betweenbalanced load and communication
NVIDIA C2070 (ECC=1)
38March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
MPI + X programming modelHybrid parallel spMVM
“MPI progress thread” helpsReducing MPI processes helpsThread-core affinity is critical! (pinning!)
Intel Westmere with QDR IB
39March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing
Acknowledgements
RRZE groupG. Hager, T. Zeiser, J. Treibig, G. Schubert, J. Habich, M. Wittmann
waLBerlaU. Rüde, C. Feichtinger,…
Several of our CE students…
BMBF Project SKALB (01 IH08003A)
Questions?
http://www.rrze.uni-erlangen.de/hpc/
KONWIHR-II projects OMI4papps , HQS@HPC