Application Performance in the Multi-Core Era – Lessons to ... · 3 SIAM CSE11 Application performance – Less ons for EFlops Computing. Frequency [MHz] 1 10 100 1000 10000. 1

Application Performance in the Multi-Core Era –

Lessons to Be Learned for Exascale Computing

(“Tales from the trenches”)

Gerhard WelleinDepartment for Computer Science Erlangen Regional Computing CenterFriedrich-Alexander-University Erlangen-Nuremberg

SIAM Conference on Computational Science and Engineering 2011

2March 1, 2011 SIAM CSE11Application performance – Lessons for EFlops Computing

Computers - Pandora’s Box?

CSE has extremely benefited technologicaladvancements insemiconductor technology

Focus of CSE isoften on Applications & Methods Computational Science

and Engineering

Application

Num. Methods/ MathComputer

Ken Olsen (DEC):“Software comes from heaven when you have good hardware”


Frequency [MHz]

1

10

100

1000

10000

1971

1974

1977

1980

1983

1986

1989

1992

1995

1998

2001

2004

2009

Year

No reason to touch Pandora’s box until 2004/5

Exponential growth of CPU clock speed for 15+ years

Intel x86 clock speed

Better architectural features:

• Pipelining

• Superscalarity

• SIMD / Vector ops

• Larger caches

BUT

No fundamental architectural

changes until 2004/5


The driving force behind: Moore’s law

Electronics Magazine, April 1965: The complexity for minimum component costs has increased at a rate of roughly a factor of two per year… Certainly over the short term this rate can be expected to continue, if not to increase.

Intel Nehalem EX: 2.3 Billion

nVIDIA FERMI: 3 Billion

Intel Corp


Trading single thread performance for parallelism

Power consumption limits clock speed: P ~ f2 (worst case ~f3) Core supply voltage approaches a lower limit: VC ~ 1VTDP approaches economical limit: TDP ~ 80 W,…,130 W

Moore’s law is still valid…more cores + new on-chip functionality (PCIe; GPU)

Be prepared for more cores with lower complexity and clock speed!

P5 / 80586 (1993) Pentium3 (1999) Pentium4 (2003) Core i7–960 (2009)

66 MHz 600 MHz 2800 MHz 3200 MHz

16 W @ VC = 5 V 23 W @ VC = 2 V 68 W @ VC = 1.5 V 130 W @ VC = 1.3

800 nm / 3 M 250 nm / 28 M 130 nm / 55 M 45 nm / 730 M

Process technology / Number of transistors in million

TDP / Core supply voltage

Quad-Core

Parallelism and GPGPUs – driving the HPC ecosystem from PFlop/s to EFlop/s


Rank Site Manufacturer Computer Country Cores Rmax[Tflops]

Power[MW]

1 National SuperComputer Center in Tianjin NUDT

Tianhe-1ANUDT TH MPP,

Xeon 6C, NVidia, FT-1000 8CChina 186,368 2,566 4.04

2 Oak Ridge National Laboratory Cray Jaguar

Cray XT5, HC 2.6 GHz USA 224,162 1,759 6.95

3 National Supercomputing Centre in Shenzhen Dawning

NebulaeTC3600 Blade, Intel X5650,

NVidia Tesla C2050 GPU China 120,640 1,271 2.58

4 GSIC, Tokyo Institute of Technology NEC/HP

TSUBAME-2HP ProLiant, Xeon 6C, NVidia Japan 73,278 1,192 1.40

5 DOE/SC/LBNL/NERSC Cray Hopper

Cray XE6, 6C 2.1 GHz USA 153,408 1.054 2.91

6 Commissariat a l'Energie Atomique (CEA) Bull Tera 100

Bull bullx super-node S6010/S6030 France 138.368 1,050 4.59

7 DOE/NNSA/LANL IBMRoadrunner

BladeCenter QS22/LS21 USA 122,400 1,042 2.34

8 University of Tennessee Cray KrakenCray XT5 HC 2.36GHz USA 98,928 831.7 3.09

9 Forschungszentrum Juelich (FZJ) IBM Jugene

Blue Gene/P Solution Germany 294,912 825.5 2.26

10 DOE/NNSA/LANL/SNL Cray Cielo

Cray XE6, 6C 2.4 GHz USA 107,152 816.6 2.95

TOP 10 supercomputers (Nov. 2010)


Performance projection based on TOP500

Large investments in hardware AND software are on the way to enable ExaFlop/s (1018 FLOP/second) computers by the end of the decade

0,1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

1994 1998 2002 2006 2010 2014 2018

1 Gflop/s

1 Tflop/s

1 Pflop/s

N=1

N=500

1 Eflop/s

Intel Desktop


Potential Future Architectures

Systems 2009 2015 +1/-0 2018 +1/-0

System peak 2 Peta 100-300 Peta 1 Exa

Power 6 MW ~15 MW ~20 MW

System memory 0.3 PB 5 PB 64 PB (+)

Node performance 125 GF 0.5 TF or 7 TF 1-2 or 10TF

Node memory BW 25 GB/s 1-2TB/s 2-4TB/s

Node concurrency 12 O(100) O(1k) or 10k

Total Node Interconnect BW 3.5 GB/s 100-200 GB/s 200-400GB/s

System size (nodes) 18,700 50,000 or 500,000 O(100,000) or O(1M)

Total concurrency 225,000 O(100,000,000) *O(10)-O(50) to hide latency

O(billion) * O(10) to O(100) for latency hiding

Storage 15 PB 150 PB 500-1000 PB (>10x system memory is min)

IO 0.2 TB 10 TB/s 60 TB/s (how long to drain the machine)

MTTI days O(1day) O(1 day) By

cour

tesy

of J

ohn

Sha

lf (L

BN

L) /

DO

E E

xasc

ale

Ste

erin

g C

omm

ittee

Presenter

Presentation Notes

Missing Latencies Message injection rates Flops/watt cost money and bytes/flop that costs money Joule/op in 2009, 2015 and 2018: 2015: 100 pj/op Capacity doesn’t cost as much power as bandwidth how many joules to move a bit 2 picojoule/bit 75pj/bit for accessing DRAM 32 Petabytes: with system memory at same fraction of system Need $ number Best machine for 20MW and best machine for $200M Memory op is 64 bit word of memory 75 picojoule bit for (multiply by 64) (DDR 3 spec) 50 pj/ for an entire 64 bit op Memory technology in 5pj/bit by 2015 if we invest soon Anything more aggressive than 4pj/bit is close to the limit (will not sign up for 2pj/bit) 2015 10 pj/flop 5pj/flop in 2018 So we are talking 30:1 ratio of memory reference per flop 10pj/operation to bring a byte in 8 terabits * 1pj -> 8 watts JEDEC is fundamentally broken (DDR4 is the end) Low swing differential insertion of known technology 20GB/s per component to 1 order of magnitude more 10-12 Gigabits/second per wire 16-64 using courant limited scaling of hydro codes Cost per DRAM in that timeframe and how much to spend # outstanding memory references per cycle- bandwidth * latency above based on memory reference size memory concurrency 200 cycles from DRAM (2GHz) is 100ns (40ns for memory alone). With queues will be 100ns O(1000) references per node to memory O(10k) for 64 byte cache lines? Need to add system bisection: 2015: whatever local node bandwidth: factor of 4-8 or 2-4 against per-node interconnect bandwidth 2018: Occupancy vs latency: zero occupancy (1 slot for message launch) 5ns per 2-4 in 2015 2-4 in 2018 10^4 vs 10^9th


Trends towards ExaFlops computing

1. Reduced single core complexity Single thread performance will be crucial Trade flexibility for performance

2. Memory gap and complexity further increases Don’t forget basic main memory access optimization! We are living in a world of ccNUMA architectures Data locality effectsMulti-core processors have shared caches Why do we ignore them?

3. Are GPGPU-like architecture the holy grail?!

4. Total concurrency O(109)MPI has not been designed for that level of concurrency O(102) concurrency in 1993 New programming models?

1. Reduced single thread performance – trading flexibility for performance

A simple question with religious dimensions..

What is the right programming language/style?!

C / C++ / Java / FORTRAN ?


Impact of programming language/styleA representative example ?!

Sparse matrix-vector multiplication (e.g. Finite-Element Method)y = y + M*x

The classical Fortran/C approach: Compressed Row Storage (CRS) for matrix M

Sparse MVM code snippet:

for(i = 0; i < number_of_unknowns; ++i){

for(j = row(i); i < row(i+1);++j){

y[i] = y[i] + entry[j] *x[column[j]];

}}}


Impact of programming language/styleA representative example?!


The object oriented C++ approach: Consider each row of M as an object, e.g. the stencil of a node in FEM Matrix M is a vector of “stencils”

for(i=0; i < number_of_unknowns; ++i){for(j=0; j < stencil_array[i].m_row_stencil_length;++j){

y.m_vektor[i] = y.m_vektor[i] + stencil_array[i].m_row_stencil[j] * x.m_vektor[stencil_array[i].m_row_position[j]];

}}}

//Class Stencilclass Stencil{ int m_row_stencil_length; double *m_row_stencil; int *m_row_position; }


Impact of programming language/styleA representative example?!


Problem: FEM on semi structured grid with 55056 vertices

Performance ofsimple CG solverincluding spMVM

Testmachine:Intel Core2 2 .26GHz(Penryn)

82

610

0

100

200

300

400

500

600

700

MFl

op/s

Object Oriented Classical CRS

8 X


A few comments on programming language/style

Don’t be religious about the programming style!

Adopt the style to the problem and remember:

PERFORMANCE * FLEXIBILITY = constant

IF you program C++ like Fortran Fortran performance

BUT do students still know Fortran?!

2. Memory gap and complexity further increases

MEMORY MEMORY


Boltzmann Equation

Discretization of particle velocity space

][1 )0(fffft −−=∇⋅+∂λ

ξ timerelaxation ...

functionon distributi mequilibriu ... velocityparticle...

)0(

λ

ξ

f

][1 )(eqt ffff ααααα λ

ξ −−=∇⋅+∂ t),,(),(

t),,(),()0()(

αα

αα

ξ

ξ

xftxf

xftxfeq rr

rr

=

=

ξα – determined by discretization scheme

Don’t forget main memory access optimizationsLattice Boltzmann method

),(~),(

)],(),([),(),(~ )(

txftttexf

txftxftxftxf

ii

ieq

iii

ααα

αααα

δδ

ω

=++

−−=r

collision step:

streaming step:

Often programming follows the physical equations, but doing collision and streaming separately doubles main memory transfer!

Presenter

Presentation Notes

Lattice Boltzmann Method mesoscopic: simplification of microscopic quantities -> macroscopic quantities can be calculated by avergaing Method for solving the Boltzmann equation First: discretization of particle velocities


Don’t forget main memory access optimizationsThis is not just fun with some kernels…

waLBerla: Widely applicable LB solver from Erlangen (Uli Rüde’s group)

waLBerla (C++)

Code management, standard

implementations Low‐levelkernels foroptimized

architecture‐specific

computations (in C++, CUDA,

Assembler)


Don’t forget main memory access optimizationsLattice Boltzmann method

double precision F(0:18,0:iMax+1,0:jMax+1,0:kMax+1,0:1)do k=1,kMax

do j=1,jMaxdo i=1,iMax

if( fluidcell(x,y,z) ) thenLOAD F(0:18,i,j,k,t)Relaxation (complex computations)SAVE F( 0,i ,j ,k ,t+1)SAVE F( 1,i+1,j+1,k ,t+1)SAVE F( 2,i ,j+1,k ,t+1)SAVE F( 3,i-1,j+1,k ,t+1)…SAVE F(18,i ,j-1,k-1,t+1)

endifenddo

enddoenddo

LD 1-2 Cachelines (cont. access)

LD & ST 19 Cachelines

Collide Step

Stream Step

#load from main memory: 19*iMax*jMax*kMax + 19*iMax*jMax*kMax

#store to main memory: 19*iMax*jMax*kMax

If cache line of store operation is not in cache it must be loaded first (“write allocate”)!

Data layoutF( Q , I , J , K)


Don’t forget main memory access optimizationsEnsure spatial locality

D3Q19 stencil

3 (I*J) planes fit into L3 cache size / (2 * threads)


Don’t forget main memory access optimizationsChoice of data layout and “streaming stores”

F( I , J, K , Q ) data layout:38 cache lines need to be stored in cache

Save 1/3 of datatransfer usingstreaming stores(NT-stores) bypassingcache

Effective memorybandwidth as measuredwith STREAM setsan ultimate limit

First version of CE Master student (2nd year) (seminar code)MS109Thursday


k-direction

j-dire

ctio

n

core0 core1

Cache

Memory

x

do t=1,tMax!$OMP PARALLEL DO private(…)

do k=1,Ndo j=1,N

do i=1,Ny(i,j,k) = …

enddoenddo

enddoy x

enddo

Multi-Core specific features – Room for new ideasThe classical parallelization approach


z-direction

y-di

rect

ion

core0: x(:,:,k-1:k+1)t tmp(:,:,mod(k,4))

core1: tmp(:,:,mod(k-3,4):mod(k-1,4)) x(:,:,k-2)t+2

core0 core1

tmp(:,:,3)

Memory

x(:,:,:)

Reduce main memory accesses by a factor 2!

y(:,:,:) is obsolete!

Save main memory data transfers for y(:,:,:)!

Use ring buffer tmp(:,:, 0:3)which fits into the cache

Sync threads/cores after each k-iteration

Multi-Core specific features – Room for new ideasMake use of shared resources: parallel wavefronts


Multi-Core specific features – Room for new ideasWavefront parallelization of Gauß-Seidel solver

Shared caches in Multi-Core processorsFast thread synchronizationFast access to shared data structures

FD discretization of 3D Laplace equation:Parallel lexicographical Gauß-Seidel using pipeline approach (“threaded”)Combine threaded approach with wavefront technique (“wavefront”)

wavefront

threaded

02000400060008000

1000012000140001600018000

1 2 4 8

threadedwavefront

Threads

MFL

OP/

s

Intel Core i7-2600

3.4 GHz; 4 cores

SMT


STREAM benchmark on 12-core Intel Westmere:Data locality: Anarchy vs. thread pinning

No pinning

Pinning (physical cores first)Thread/process – core affinity is

critical in ccNUMA environments

CC

CC

CC

CC

CC

CC

C

MI

Memory

PT0

T1P

T0

T1P

T0

T1P

T0

T1P

T0

T1P

T0

T1

CC

CC

CC

CC

CC

CC

C

MI

Memory

PT0

T1P

T0

T1P

T0

T1P

T0

T1P

T0

T1P

T0

T1

Open source toolkit LIKWIDhttp://code.google.com/p/likwid

3. Are GPGPU-like architecture the holy grail?!

No HPC conference without a vast number of GPGPU submissions…


Trading single thread performance for parallelismGPGPUs vs. CPUs

GPU vs. CPU light speed estimate:

1. Compute bound: 4-5 X2. Memory Bandwidth: 2-5 X

Intel Core i5 – 2500 (“Sandy Bridge”)

Intel X5650 DP node (“Westmere”)

NVIDIA C2070 (“Fermi”)

Cores@Clock 4 @ 3.3 GHz 2 x [email protected] GHz 448 @ 1.1 GHzPerformance+/core 52.8 GFlop/s 21.3 GFlop/s 2.2 GFlop/sThreads@stream 4 12 8000 +

Total performance+ 210 GFlop/s 255 GFlop/s 1,000 GFlop/sStream BW 17 GB/s 41 GB/s 90 GB/s (ECC=1)

Transistors / TDP 1 Billion* / 95 W 1.17 Billion / 95 W 3 Billion / 238 W

* Inc

lude

s on

-chi

p G

PU

and

PC

I-Exp

ress

+ Single Precision


GPGPU application performanceLattice Boltzmann solver “walberla”

2.2 X

MS109Thursday


GPGPUs: MD simulation with AMBER The continuous pain of long term simulation

Long term MD simulation for the human nuclear receptor hNUR7726167 atoms Parallel scalability(233 amino acids, 2 chlorine counter ions, 7486 water molecules) AMBER / pmemdTotal simulation time: 2.47μs

Performance: ns/day matters

Y2004: 18 nodes (2 x Intel Xeon 3.2 GHz) with SDR IB: 8.6 ns/day

Y2010: 2 x 6-core Intel Xeon 2.66 GHz nodes with QDR IB1 node (12 cores) 9.8 ns/day2 nodes (24 cores) 16.2 ns/day4 nodes (48 cores) 25.4 ns/day

Amber/11-p11-intel11.1-intelmpi4.0

Y2010: nVIDIA C2070 (“Fermi”)(Mixed SPDP precision)

1 x C2070 (ECC=0) 14.0 ns/day1 x C2070 (ECC=1) 12.1 ns/day

2 x C2070 (ECC=1) 16.5 ns/day

Amber11-p12-cuda3.2-openmpi1.4


GPGPU application performance Computer Tomography reconstruction

General motivation: Reconstruction of 3D data from 2D X-ray projection images.

Here: X-ray projections acquired during an intervention.

Reconstruction should be as fast as possible:

Interventional: Reduce time patient lying on tableTime resolved (4D) reconstruction

Method: 3D cone beam reconstruction of high resolution C-arm Computer Tomography dataset

Courtesy of J. Hornegger, H. Hofmann


GPGPU application performance RabbitCT – 3-D reconstruction: open competition

RabbitCTOpen platform for performance-comparison

of back projection implementations based on a high resolution C-arm CT dataset of a rabbit

Everyone is invited to submit results

Department of Neuroradiology and Pattern Recognition Lab; Univ. Erlangen-Nuremberg

References:http://www.rabbitct.com/http://code.google.com/p/rabbitct/


GPGPU application performance RabbitCT – a perfect match for GPUs only?

Original ranking (http://www5.informatik.uni-erlangen.de/research/projects/rabbitct/ranking/?ps=256)1. NVIDIA GeForce GTX 260 (OpenCl) 5.484 ms/view2. NVIDIA Quadro FX5800 (CUDA) 9.593 ms/view3. Intel Core2 Duo T9300 @ 2.50GHz 1046.245 ms/view

Testcase:

• 2563 voxels

• 496 “views”, i.e. 2-D CT images,each 4.5 MB

• Cache bound for CPU

• Single precision arithmetic

Specific GPU hardware can be used in some operations (bilinear interpolation)!


GPGPU application performance RabbitCT – a perfect match for GPUs only?

Best available GPU number (NVIDIA GTX 260): 5.48 ms/view

Challenger: Intel Core i5 – 2600 (3.4 GHz; 4 cores; SMT) Peak Perf. (single precision): 217.6 GFlop/s (with AVX)

(108.8 GFlop/s (with SSE))

8 threads

C++ Textbook 44.67 ms/view

SSE intrinsics with advanced

optimizations23.78 ms/view

SSE assembly kernel (1 week) 15.28 ms/view

70 X speed-up vs. old CPU reference number (cf. prev. slide)AVX assembly is ongoing work

More powerful CPU systems

Intel Westmere ([email protected] GHz)7.63 ms/view (24 threads)

Intel Nehalem EX ([email protected] GHz) 4.25 ms/view (64 threads)

But this is a 30k+ USD system…


If you were plowing a field, which would you rather use?

Two strong oxen or 1024 chickens?(Attributed to Seymour Cray)

GPGPUs: continuing a long term trend?

4. New parallel programming models

The end of MPI?

The rise of UPC, CoArray Fortran, Chapel, X10,…?


The end of MPI?“MPI + X” should be do the job in most cases

Why MPI?After 20 yrs it is still not perfect, but it worksIndustry finally did accept it - after some timeEfficiency does not depend on specific hardware or compiler capabilities

Hybrid “MPI + X” approach:Use MPI at coarsest parallel level and slowest communication pathUse another parallel programming model X=OpenMP, UPC, CoArray, CUDA/GPU at intermediate (preferably shared memory) level And do not forget massive SIMD architectures

Intel’s AVX performs 4 double precision floating point ops per instructionIntel’s MIC (formerly known as Larabee) can operate on 512-Bit registers & instructions


MPI + X programming modelHybrid MPI+OpenMP

If you code scales perfectly with MPI – do not touch it!

Hybrid MPI+OpenMP is beneficial e.g.if load imbalance limits scalabilityif a separate communication thread is advantageousIf physical problem limits MPI scalability…

Parallel spMVMProblem: Time evolution of finite quantum systems (“Holstein model”)DMatrix= 6.2 *106

NNZ = 100*106

Competition betweenbalanced load and communication

NVIDIA C2070 (ECC=1)


MPI + X programming modelHybrid parallel spMVM

“MPI progress thread” helpsReducing MPI processes helpsThread-core affinity is critical! (pinning!)

Intel Westmere with QDR IB


Acknowledgements

RRZE groupG. Hager, T. Zeiser, J. Treibig, G. Schubert, J. Habich, M. Wittmann

waLBerlaU. Rüde, C. Feichtinger,…

Several of our CE students…

BMBF Project SKALB (01 IH08003A)

Questions?

http://www.rrze.uni-erlangen.de/hpc/

KONWIHR-II projects OMI4papps , HQS@HPC

http://www.rrze.uni-erlangen.de/hpc/

Documents

Application Performance in the Multi-Core Era – Lessons to ... · 3 SIAM CSE11 Application performance – Less ons for EFlops Computing. Frequency [MHz] 1 10 100 1000 10000. 1