Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

1

Hardware Aware ProgrammingExploiting the memory hierarchy and

parallel multicore processors

Lehrstuhl für Informatik 10 (Systemsimulation)Universität Erlangen-Nürnberg

www10.informatik.uni-erlangen.de

Canberra, July 2008

U. Rüde (LSS Erlangen, [email protected])joint work with

J. Götz, M. Stürmer, K. Iglberger, S. Donath, C. Feichtinger, T. Preclik, T. Gradl, C. Freundl, H. Köstler, T. Pohl, D. Ritter, D. Bartuschat, P. Neumann,

G. Wellein, G. Hager, T. Zeiser, J. Habich (RRZE)N. Thürey (ETH Zürich)

2

OverviewTo PetaScale and BeyondOptimizing Memory Access and Cache-Aware ProgrammingMassively Parallel Multigrid: Performance ResultsMultiCore ArchitecturesCase study: Lattice Boltzmann Methods for Flow Simulation on the Play StationConclusions

3

Part I

Towards PetaScale and Beyond

4

0

2000

4000

6000

8000

729 4.913 35.937 274.625 2,146,689

JDSStencils

HHG Motivation I Structured vs. Unstructured Grids

(on Hitachi SR 8000)

MFlops rates for matrix-vector multiplication on one node Structured versus sparse matrixMany emerging architectures have similar properties (Cell, GPU)

Extinct Dinosaur HLRB-I: Hitachi SR 8000

No. 5 in TOP-500 in 20002 TFlops

5

HHG Motivation II: DiMe - Project

Started jointly with Linda Stals in 1996 in Ausgburg!Cache-optimizations for sparse matrix/stencil codes (1996-2007)Efficient hardware optimized

Multigrid SolversLattice-Boltzmann CFD

with free surface flowfluid structure interaction

www10.informatik.uni-erlangen.de/de/Research/Projects/DiME/

Data Local Iterative Methods (1996-2007) for theEfficient Solution of Partial Differential Equations

6

Evolution of Semiconductor Technolgy

Collects trends in semiconductor technologySee http://www.itrs.net/reports.html

7

Where does Computer Architecture Go?Computer architects have capitulated: It may not be possible anymore to exploit progress in semiconductor technology for automatic performance improvements

Even today a single core CPU is a highly parallel system:superscalar execution, complex pipeline, ... and additional tricksInternal parallelism is a major reason for the performance increases until now, but ... There is a limited amount of parallelism that can be exploited automatically

Multi-core systems concede the architects´ defeat:Architects fail to build faster single core CPUs given more transistorsClock rate increases only slowly (due to power considerations)

Therefore architects have started to put several cores on a chip:programmers must use them directly

8

What are the consequences?

For the application developers “the free lunch is over”

Without explicitly parallel algorithms, the performance potential cannot be used any more

For HPCCPUs will have 2, 4, 8, 16, ..., 128, ..., ??? cores - maybe sooner than we are ready for thisWe will have to deal with systems with millions of cores

9

Memory access as a major bottleneck25+ years ago: Telefunken TR440: 16 000 words

Memory fills a rack with8 × 20 drawerseach with 100 data cardsone rack of about 0,8m × 2m

Today: HLRB-II (Altix 4700): 5 × 1 000 000 000 000 wordsMemory fills a rack2m high and ranging roughly from earth to the moon

or, better organized, a rack system500 m wide (500 rows of racks)2m high (20 drawers, 100 cards, each)500 km long (5.000.000 columns)

10

Part II

Optimizing Memory Access andCache-Aware Programming

11

Increasing single-CPU performance by optimizing data locality

Caches work due to the locality of memory accesses (instructions + data)

(Numerically intensive) codes should exhibit:Spatial locality:

Data items accessed within a short time period are located close to each other in memoryTemporal locality:

Data that has been accessed recently is likely to be accessed again in the near future

Goal: Increase spatial and temporal locality in order to enhance cache utilization (cache-aware progr.)

12

Cache performance optimizations

Data layout optimizations: Change the data layout in memory to enhance spatial

localityData access optimizations:

Change the order of data accesses to enhance spatial and temporal locality

These transformations preserve numerical results and their introduction can (theoretically) be automated!

13

Data access optimizations:Loop fusion

Example: red/black Gauss-Seidel iteration in 2D

14

Data access optimizations:Loop fusion (cont’d)

Code before applying loop fusion technique(standard implementation w/ efficient loop ordering, Fortran semantics: row major order):

for it= 1 to numIter do // Red nodes for i= 1 to n-1 do for j= 1+(i+1)%2 to n-1 by 2 do relax(u(j,i)) end for end for

15


// Black nodes for i= 1 to n-1 do for j= 1+i%2 to n-1 by 2 do relax(u(j,i)) end for end forend for

This requires two sweeps through the wholedata set per single GS iteration!

16


How the fusion technique works:

17


Code after applying loop fusion technique:

for it= 1 to numIter do // Update red nodes in first grid row for j= 1 to n-1 by 2 do

relax(u(j,1)) end for

18


// Update red and black nodes in pairs for i= 1 to n-1 do for j= 1+(i+1)%2 to n-1 by 2 do

relax(u(j,i)) relax(u(j,i-1))

end for end for

19


// Update black nodes in last grid row for j= 2 to n-1 by 2 do relax(u(j,n-1)) end for

Solution vector u passes through thecache only once instead of twice per GSiteration!

20

Data access optimizations:Loop split

The inverse transformation of loop fusionDivide work of one loop into two to make body less complicated

Leverage compiler optimizationsEnhance instruction cache utlization

21

Data access optimizations:Loop blocking

Loop blocking = loop tilingDivide the data set into subsets (blocks) which are small enough to fit in cachePerform as much work as possible on the data in cache before moving to the next blockThis is not always easy to accomplish because of data dependencies

22


Example: 1D blocking for red/black GS, respect the data dependencies!

23


Code after applying 1D blocking techniqueB = number of GS iterations to be blocked/combined

for it= 1 to numIter/B do // Special handling: rows 1, …, 2B-1 // Not shown here …

24


// Inner part of the 2D grid for k= 2*B to n-1 do for i= k to k-2*B+1 by –2 do for j= 1+(k+1)%2 to n-1 by 2 do relax(u(j,i)) relax(u(j,i-1)) end for end for end for

25


// Special handling: rows n-2B+1, …, n-1 // Not shown here …end for

Result: Data is loaded once into the cache per B Gauss-Seidel iterations, provided that 2*B+2 grid rows fit in the cache simultaneouslyIf grid rows are too large, 2D blocking can be applied

26

Data access optimizationsLoop blocking

More complicated blocking schemes existIllustration: 2D square blocking

27

Part III

Towards Scalable FE Software

28

Multigrid: V-Cycle

Relax on

Residual

Restrict

Correct

Solve

Interpolate

by recursion

… …

Goal: solve Ah uh = f h using a hierarchy of grids

29

Cache-optimized multigrid:DiMEPACK library

DFG project DiME: Data-local iterative methodsFast algorithm + fast implementationCorrection scheme: V-cycles, FMGRectangular domainsConstant 5-/9-point stencilsDirichlet/Neumann boundary conditionshttp://www10.informatik.uni-erlangen.de/dime

30

V(2,2) cycle - bottom line

Mflops For what

13 Standard 5-pt. Operator

56 Cache optimized (loop orderings, data merging, simple blocking)

150 Constant coeff. + skewed blocking + padding

220 Eliminating rhs if 0 everywhere but boundary

31

Parallel High Performance FE MultigridParallelize „plain vanilla“ multigrid

partition domainparallelize all operations on all gridsuse clever data structures

Do not worry (so much) about Coarse Gridsidle processors?short messages?sequential dependency in grid hierarchy?

Why we do not use conventional domain decompositionDD without coarse grid does not scale (algorithmically) and is suboptimal for large problems/ many processorsDD with coarse grids may be as efficient as multigrid but is as difficult to parallelize (the difficulty is in parallelizing the coarse grid)

32

Hierarchical Hybrid Grids (HHG)Unstructured input grid

Resolves geometry of problem domainPatch-wise regular refinement

generates nested grid hierarchies naturally suitable for geometric multigrid algorithms

New: Modify storage formats and operations on the grid to exploit the regular substructures

Does an unstructured grid with 1000 000 000 000 elements make sense?

HHG - Ultimate Parallel FE Performance!

33

HHG refinement example

Input Grid

34

HHG Refinement example

Refinement Level one

35


Refinement Level Two

36


Structured Interior

37


Structured Interior

38


Edge Interior

39


Edge Interior

40

Parallel HHG - FrameworkDesign Goals

To realize good parallel scalability:

Minimize latency by reducing the number of messages that must be sentOptimize for high bandwidth interconnects ⇒ large messagesAvoid local copying into MPI buffers

41

HHG for ParallelizationUse regular HHG patches for partitioning the domain

42

HHG Parallel Update Algorithmfor each vertex do apply operation to vertexend for for each edge do copy from vertex interior apply operation to edge copy to vertex haloend for

for each element do copy from edge/vertex interiors apply operation to element copy to edge/vertex halosend for

update vertex primary dependencies

update edge primary dependencies

update secondary dependencies

43

Towards Scalable FE Software

Performance Results

44

Node Performance is Difficult! (B. Gropp)DiMe project: Cache-aware Multigrid (1996- ...)

grid size 173 333 653 1293 2573 5133standard 1072 1344 715 677 490 579no blocking 2445 1417 995 1065 849 8192x blocking 2400 1913 1312 1319 1284 12823x blocking 2420 2389 2167 2140 2134 2049

Performance of 3D-MG-Smoother for 7-pt stencil in Mflops on Itanium 1.4 GHzArray PaddingTemporal blocking - in EPIC assembly languageSoftware pipelineing in the extreme (M. Stürmer - J. Treibig)

Node Performance is Possible!

45

Single Processor HHG Performance on Itanium forRelaxation of a Tetrahedral Finite Element Mesh

46

#Proc #unkn. x 106 Ph.1: sec Ph. 2: sec Time to sol.

4 134.2 3.16 6.38* 37.98 268.4 3.27 6.67* 39.3

16 536.9 3.35 6.75* 40.332 1,073.7 3.38 6.80* 40.6

64 2,147.5 3.53 4.92 42.3128 4,295.0 3.60 7.06* 43.2

252 8,455.7 3.87 7.39* 46.4504 16,911.4 3.96 5.44 47.6

2040 68,451.0 4.92 5.60 59.03825 128,345.7 6.90 82.8

4080 136,902.0 5.68

6102 205,353.1 6.33

8152 273,535.7 7.43*

9170 307,694.1 7.75*

Parallel scalability of scalar elliptic problemdiscretized by tetrahedral finite elements.

Times for 12 V(2,2) cycles on SGI Altix: Itanium-2 1.6 GHz.

Largest problem solved to date:3.07 x 1011 DOFS on 9170 Procs: 7.8 s per V(2,2) cycle

B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006, also: „Is 1.7× 1010 unknowns the largest finite element system that can be solved today?“, SuperComputing, Nov‘ 2005.

47

So what?With scalable algorithms, well implemented, we can do (scalar) PDEs with

> 10 million unknows on a desktop> 300 billion unknowns on a TOP-50 class machine (HLRB-II, 63 Tflop peak, 40 TeraByte Mem)

In the future, we will be able to doaround 2010: ≈ 5 trillion unknowns on a PetaScale machine (assuming 1 PByte memory)around 2012-2015: ≈ 50 trillion unknowns on a machine delivering a petaflop for real applications (assuming 10 Pbyte memory)

This is e.g. sufficient to resolve all of earth‘s atmosphere with10 km grid resolution (current desktop)250 m mesh (current supercomputer)100 m mesh (Peak-Peta-Scale system in 2010?)50 m mesh (Application-Peta-Scale system in 2015?)

This is a buidling block for many other applications

48

Programming techniques

Seemingly conflicting goals:Portability/Flexibility:Code should run on a variety of (parallel) target platforms, including PC clusters, NUMA machines, etc.

Efficiency:Code should run as efficiently as possible on each target platform

How can this conflict be solved?

49

Part IV

Multicore Architectures

50

The STI Cell Processorhybrid multicore processor based on IBM Power architecture(simplified) PowerPC core

runs operating systemcontrols execution of programs

multiple co-processors (8, on Sony PS3 only 6 available)operate on fast, private on-chip memoryoptimized for computationDMA controller copies data from/to main memory

• multi-buffering can hide main memory latencies completely for streaming-like applications

• loading local copies has low and known latencies

memory with multiple channels and links can be exploited if many memory transactions are in-flight

51

The STI Cell Broadband Engine

52

Cell LBM SimulationsGoal: demanding (flow) simulations at moderate cost but very fast, e.g. for simulation of blood-flow in an aneurysm for therapy and surgery planningAvailable cell systems:

BladesPlaystation 3

53

Synergistic Processor Unit“very small computer of its own”

128 128-bit all-purpose registersoperates on 256 kB of Local Store (LS)

nearly all operations are SIMD onlyone scalar operation is more expensive than a SIMDonly load and store of 16 naturally aligned bytes from/to LS

25.6 GFlops (single precision fused-multiply-add)only truncation, fast double precision will be available soon

no dynamic branch prediction, only hints in softwarebut around 20 cycles branch miss penalty

no system calls or privileged operations

54

Memory Flow Controllercommunication interface (to PPE and other SPEs)

mailboxes and signal notificationmemory mapping of Local Store and register fileutilized by PPU to upload programs and control SPU

asynchronous data transfers (DMA)LS <-> main memory, other LSes or devices16 DMAs in-flightlist transfers possible (scatter / gather)only naturally aligned transfers of 1, 2, 4, 8, n·16 bytesusually multiple transfers on multiple MFCs are necessary to saturate main memory bandwidth

all interaction with SPU through channel interface

55

Programming the Cell-BEthe hard way

control SPEs using management librariesissue DMAs by language extensionsdo address calculations manuallyexchange main memory addresses, array sizes etc.synchronization using mailboxes, signals or libraries

frameworksAccelerated Library Framework (ALF) and Data, Communication, and Synchronization (DaCS) by IBMRapidmind SDK

accelerated librariessingle-source-compiler

IBM’s xlc-cbe-sse is in alpha stage, uses OpenMP

56

Naive SPU implementation: A[] = A[]*cvolatile vector float ls_buffer[8] __attribute__((aligned(128)));

void scale( unsigned long long gs_buffer, // main memory address of vector

int number_of_chunks, // number of chunks of 32 floats

float factor ) { // scaling factor

vector float v_fac = spu_splats(factor); // create SIMD vector with all

// four elements being factor

for ( int i = 0 ; i < number_of_chunks ; ++i ) {

mfc_get( ls_buffer , gs_buffer , 128 , 0 ,0,0); // DMA reading i-th chunk

mfc_write_tag_mask( 1 << 0 ); // wait for DMA...

mfc_read_tag_status_all(); // ...to complete

for ( int j = 0 ; j < 8 ; ++j )

ls_buffer[j] = spu_mul( ls_buffer[j] , v_fac ); // scale local copy using SIMD

mfc_put( ls_buffer ,gs_buffer , 128 , 0 ,0,0); // DMA writing i-th chunk

mfc_write_tag_mask( 1 << 0 ); // wait for DMA...

mfc_read_tag_status_all(); // ...to complete

gs_buffer += 128; // incr. global store pointer

} }

57

Remove latencies using multi-buffering

mfc_get( ls_buffer[0] , gs_buffer , 128 , 0 ,0,0); // request first chunk

for (int i = 0; i < number_of_chunks; ++i) {

int cur = ( i ) % 3; // buffer no. and DMA tag for i-th chunk

int next = (i+1) % 3; // " for (i-2)-th and (i+1)-th chunk

if (i < number_of_chunks-1) {

mfc_write_tag_mask( 1 << next ); // make sure the (i-2)-th chunk...

mfc_read_tag_status_all(); // ...has been stored

mfc_get( ls_buffer[next] , gs_buffer+128 , 128 , next ,0,0); // request (i+1)-th chunk

}

mfc_write_tag_mask( 1 << cur ); // wait until i-th chunk...

mfc_read_tag_status_all(); // ...is available

for (int j = 0; j < 8; ++j) ls_buffer[cur][j] = spu_mul(ls_buffer[cur][j],v_fac);

mfc_put( ls_buffer[cur] , gs_buffer , 128 , cur ,0,0);// store i-th chunk

gs_buffer += 128;

}

mfc_write_tag_mask( 1 | 2 | 4 ); // wait for any...

mfc_read_tag_status_all(); // outstanding DMA

volatile vector float ls_buffer[3][8] __attribute__((aligned(128)));

...

58

Part V

Case study: Lattice Boltzmann Methods for Flow Simulation on the Play Station

59

Example OMP-parallel Flow AnimationResolution: 880*880*336; 260M cells, 6.5M active on average

60

Simulationof Metal Foams

Joint work with C. Körner, WTM Erlangen

61

Aneurysms• Aneurysm are local dilatations of the blood vessels• Localized mostly at large arteries in soft tissue (e.g.

aorta, brain vessels)• Can be diagnosed by modern imaging techniques (e.g.

MRT,DSA)• Can be treated e.g by clipping or coiling

62

A data structure for simulating flow in blood vessels

• In a brain geometry only about 3-10% of the nodes are fluidal

• We use a domain decoupling in equally sized blocks, so-called patches and only allocate patches containing fluid cells

• The memory requirements and the computational time could be reduced significantly

• For the Cell processor we use patches of size 8x8x8, fitting into the SPU local memory

63

Results

Velocity near the wall in an aneurysm

Oscillatory shear stress near the wall in an aneurysm

64

Pulsating Blood Flow in an Aneurysm

Datensatz

Collaboration between:Neuro-Radiology (Prof. Dörfler, Dr. Richter)

Computer Science

Simulation

Imaging

CFD

65

LBM Optimized for Cell

memory layoutoptimized for DMA transfersinformation propagating between patches is reordered on the SPE and stored sequentially in memory for simple and fast exchange

code optimizationkernels hand-optimized in assembly codeSIMD-vectorized streaming and collisionbranch-free handling of bounce-back boundary conditions

66

Performance Results

0

12,5

25,0

37,5

50,0

Xeon 5160 PPE SPE*

49,0

2,04,8

10,4

straight-forward C codeSIMD-optimized assembly*on Local Store without DMA transfers

67

Performance Results

30,0

47,5

65,0

82,5

100,0

1 2 3 4 5 6

95949493

81

42

68

Performance Results

0

12,5

25,0

37,5

50,0

Xeon 5160* Playstation 3

43,8

11,7

21,1

9,1

1 core1 CPU

*performance optimized code by LB-DC

Other work:LBM on Graphics Hardware

see also: work by Jonas Tölke and M. Krafczyk at TU BraunschweigMaster thesis by J. Habich (co-suprvised jointly with G. Wellein, RRZE Erlangen)

nVidia Geforce 8800 GTX (G80 Processor) up to 250 Fluid MLUP/safter careful tuning!

69

Multigrid on Cell ProcessorMaster Thesis by Daniel Ritter:

A Fast Multigrid Solver for Molecular Dynamics on the Cell Broadband Engine

Performance limited by available bandwidthLocal store too small (?) for blocking techniques

70

71

Part IV

Conclusions

72

What have we learned?

The future is parallel on multi core CPUsMemory bandwidth per core will be a severe bottleneck

“inverse Moore’s law”Programming current leading edge multi-core architectures to exploit their performance potential requires expert knowledge of the architecture

better tool and system support neededcomplexity of the architecture

73

An HPC Tutorial !Getting Supercomputer Performance is Easy!

If parallel efficiency is bad, choose a slower serial algorithmit is probably easier to parallelizeand will make your speedups look much more impressive

Introduce the “CrunchMe” variable for getting high Flops ratesadvanced method: disguise CrunchMe by using an inefficient (but compute-intensive) algorithm from the start

Introduce the “HitMe” variable to get good cache hit ratesadvanced version: disguise HitMe within “clever data structures” that introduce a lot of overhead

Never cite “time-to-solution”who cares whether you solve a real life problem anywayit is the MachoFlops that interest the people who pay for your research

Never waste your time by trying to use a complicated algorithm in parallel (such as multigrid)

the more primitive the algorithm the easier to maximize your MachoFlops.

74

Talk is Over

Questions?

Documents

Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität