74
1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de Canberra, July 2008 U. Rüde (LSS Erlangen, [email protected]) joint work with J. Götz, M. Stürmer , K. Iglberger, S. Donath, C. Feichtinger, T. Preclik, T. Gradl, C. Freundl , H. Köstler, T. Pohl, D. Ritter, D. Bartuschat, P. Neumann, G. Wellein, G. Hager, T. Zeiser, J. Habich (RRZE) N. Thürey (ETH Zürich)

Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

1

Hardware Aware ProgrammingExploiting the memory hierarchy and

parallel multicore processors

Lehrstuhl für Informatik 10 (Systemsimulation)Universität Erlangen-Nürnberg

www10.informatik.uni-erlangen.de

Canberra, July 2008

U. Rüde (LSS Erlangen, [email protected])joint work with

J. Götz, M. Stürmer, K. Iglberger, S. Donath, C. Feichtinger, T. Preclik, T. Gradl, C. Freundl, H. Köstler, T. Pohl, D. Ritter, D. Bartuschat, P. Neumann,

G. Wellein, G. Hager, T. Zeiser, J. Habich (RRZE)N. Thürey (ETH Zürich)

Page 2: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

2

OverviewTo PetaScale and BeyondOptimizing Memory Access and Cache-Aware ProgrammingMassively Parallel Multigrid: Performance ResultsMultiCore ArchitecturesCase study: Lattice Boltzmann Methods for Flow Simulation on the Play StationConclusions

Page 3: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

3

Part I

Towards PetaScale and Beyond

Page 4: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

4

0

2000

4000

6000

8000

729 4.913 35.937 274.625 2,146,689

JDSStencils

HHG Motivation I Structured vs. Unstructured Grids

(on Hitachi SR 8000)

MFlops rates for matrix-vector multiplication on one node Structured versus sparse matrixMany emerging architectures have similar properties (Cell, GPU)

Extinct Dinosaur HLRB-I: Hitachi SR 8000

No. 5 in TOP-500 in 20002 TFlops

Page 5: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

5

HHG Motivation II: DiMe - Project

Started jointly with Linda Stals in 1996 in Ausgburg!Cache-optimizations for sparse matrix/stencil codes (1996-2007)Efficient hardware optimized

Multigrid SolversLattice-Boltzmann CFD

with free surface flowfluid structure interaction

www10.informatik.uni-erlangen.de/de/Research/Projects/DiME/

Data Local Iterative Methods (1996-2007) for theEfficient Solution of Partial Differential Equations

Page 6: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

6

Evolution of Semiconductor Technolgy

Collects trends in semiconductor technologySee http://www.itrs.net/reports.html

Page 7: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

7

Where does Computer Architecture Go?Computer architects have capitulated: It may not be possible anymore to exploit progress in semiconductor technology for automatic performance improvements

Even today a single core CPU is a highly parallel system:superscalar execution, complex pipeline, ... and additional tricksInternal parallelism is a major reason for the performance increases until now, but ... There is a limited amount of parallelism that can be exploited automatically

Multi-core systems concede the architects´ defeat:Architects fail to build faster single core CPUs given more transistorsClock rate increases only slowly (due to power considerations)

Therefore architects have started to put several cores on a chip:programmers must use them directly

Page 8: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

8

What are the consequences?

For the application developers “the free lunch is over”

Without explicitly parallel algorithms, the performance potential cannot be used any more

For HPCCPUs will have 2, 4, 8, 16, ..., 128, ..., ??? cores - maybe sooner than we are ready for thisWe will have to deal with systems with millions of cores

Page 9: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

9

Memory access as a major bottleneck25+ years ago: Telefunken TR440: 16 000 words

Memory fills a rack with8 × 20 drawerseach with 100 data cardsone rack of about 0,8m × 2m

Today: HLRB-II (Altix 4700): 5 × 1 000 000 000 000 wordsMemory fills a rack2m high and ranging roughly from earth to the moon

or, better organized, a rack system500 m wide (500 rows of racks)2m high (20 drawers, 100 cards, each)500 km long (5.000.000 columns)

Page 10: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

10

Part II

Optimizing Memory Access andCache-Aware Programming

Page 11: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

11

Increasing single-CPU performance by optimizing data locality

Caches work due to the locality of memory accesses (instructions + data)

(Numerically intensive) codes should exhibit:Spatial locality:

Data items accessed within a short time period are located close to each other in memoryTemporal locality:

Data that has been accessed recently is likely to be accessed again in the near future

Goal: Increase spatial and temporal locality in order to enhance cache utilization (cache-aware progr.)

Page 12: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

12

Cache performance optimizations

Data layout optimizations: Change the data layout in memory to enhance spatial

localityData access optimizations:

Change the order of data accesses to enhance spatial and temporal locality

These transformations preserve numerical results and their introduction can (theoretically) be automated!

Page 13: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

13

Data access optimizations:Loop fusion

Example: red/black Gauss-Seidel iteration in 2D

Page 14: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

14

Data access optimizations:Loop fusion (cont’d)

Code before applying loop fusion technique(standard implementation w/ efficient loop ordering, Fortran semantics: row major order):

for it= 1 to numIter do // Red nodes for i= 1 to n-1 do for j= 1+(i+1)%2 to n-1 by 2 do relax(u(j,i)) end for end for

Page 15: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

15

Data access optimizations:Loop fusion (cont’d)

// Black nodes for i= 1 to n-1 do for j= 1+i%2 to n-1 by 2 do relax(u(j,i)) end for end forend for

This requires two sweeps through the wholedata set per single GS iteration!

Page 16: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

16

Data access optimizations:Loop fusion (cont’d)

How the fusion technique works:

Page 17: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

17

Data access optimizations:Loop fusion (cont’d)

Code after applying loop fusion technique:

for it= 1 to numIter do // Update red nodes in first grid row for j= 1 to n-1 by 2 do

relax(u(j,1)) end for

Page 18: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

18

Data access optimizations:Loop fusion (cont’d)

// Update red and black nodes in pairs for i= 1 to n-1 do for j= 1+(i+1)%2 to n-1 by 2 do

relax(u(j,i)) relax(u(j,i-1))

end for end for

Page 19: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

19

Data access optimizations:Loop fusion (cont’d)

// Update black nodes in last grid row for j= 2 to n-1 by 2 do relax(u(j,n-1)) end for

Solution vector u passes through thecache only once instead of twice per GSiteration!

Page 20: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

20

Data access optimizations:Loop split

The inverse transformation of loop fusionDivide work of one loop into two to make body less complicated

Leverage compiler optimizationsEnhance instruction cache utlization

Page 21: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

21

Data access optimizations:Loop blocking

Loop blocking = loop tilingDivide the data set into subsets (blocks) which are small enough to fit in cachePerform as much work as possible on the data in cache before moving to the next blockThis is not always easy to accomplish because of data dependencies

Page 22: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

22

Data access optimizations:Loop blocking

Example: 1D blocking for red/black GS, respect the data dependencies!

Page 23: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

23

Data access optimizations:Loop blocking

Code after applying 1D blocking techniqueB = number of GS iterations to be blocked/combined

for it= 1 to numIter/B do // Special handling: rows 1, …, 2B-1 // Not shown here …

Page 24: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

24

Data access optimizations:Loop blocking

// Inner part of the 2D grid for k= 2*B to n-1 do for i= k to k-2*B+1 by –2 do for j= 1+(k+1)%2 to n-1 by 2 do relax(u(j,i)) relax(u(j,i-1)) end for end for end for

Page 25: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

25

Data access optimizations:Loop blocking

// Special handling: rows n-2B+1, …, n-1 // Not shown here …end for

Result: Data is loaded once into the cache per B Gauss-Seidel iterations, provided that 2*B+2 grid rows fit in the cache simultaneouslyIf grid rows are too large, 2D blocking can be applied

Page 26: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

26

Data access optimizationsLoop blocking

More complicated blocking schemes existIllustration: 2D square blocking

Page 27: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

27

Part III

Towards Scalable FE Software

Page 28: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

28

Multigrid: V-Cycle

Relax on

Residual

Restrict

Correct

Solve

Interpolate

by recursion

… …

Goal: solve Ah uh = f h using a hierarchy of grids

Page 29: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

29

Cache-optimized multigrid:DiMEPACK library

DFG project DiME: Data-local iterative methodsFast algorithm + fast implementationCorrection scheme: V-cycles, FMGRectangular domainsConstant 5-/9-point stencilsDirichlet/Neumann boundary conditionshttp://www10.informatik.uni-erlangen.de/dime

Page 30: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

30

V(2,2) cycle - bottom line

Mflops For what

13 Standard 5-pt. Operator

56 Cache optimized (loop orderings, data merging, simple blocking)

150 Constant coeff. + skewed blocking + padding

220 Eliminating rhs if 0 everywhere but boundary

Page 31: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

31

Parallel High Performance FE MultigridParallelize „plain vanilla“ multigrid

partition domainparallelize all operations on all gridsuse clever data structures

Do not worry (so much) about Coarse Gridsidle processors?short messages?sequential dependency in grid hierarchy?

Why we do not use conventional domain decompositionDD without coarse grid does not scale (algorithmically) and is suboptimal for large problems/ many processorsDD with coarse grids may be as efficient as multigrid but is as difficult to parallelize (the difficulty is in parallelizing the coarse grid)

Page 32: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

32

Hierarchical Hybrid Grids (HHG)Unstructured input grid

Resolves geometry of problem domainPatch-wise regular refinement

generates nested grid hierarchies naturally suitable for geometric multigrid algorithms

New: Modify storage formats and operations on the grid to exploit the regular substructures

Does an unstructured grid with 1000 000 000 000 elements make sense?

HHG - Ultimate Parallel FE Performance!

Page 33: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

33

HHG refinement example

Input Grid

Page 34: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

34

HHG Refinement example

Refinement Level one

Page 35: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

35

HHG Refinement example

Refinement Level Two

Page 36: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

36

HHG Refinement example

Structured Interior

Page 37: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

37

HHG Refinement example

Structured Interior

Page 38: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

38

HHG Refinement example

Edge Interior

Page 39: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

39

HHG Refinement example

Edge Interior

Page 40: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

40

Parallel HHG - FrameworkDesign Goals

To realize good parallel scalability:

Minimize latency by reducing the number of messages that must be sentOptimize for high bandwidth interconnects ⇒ large messagesAvoid local copying into MPI buffers

Page 41: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

41

HHG for ParallelizationUse regular HHG patches for partitioning the domain

Page 42: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

42

HHG Parallel Update Algorithmfor each vertex do apply operation to vertexend for for each edge do copy from vertex interior apply operation to edge copy to vertex haloend for

for each element do copy from edge/vertex interiors apply operation to element copy to edge/vertex halosend for

update vertex primary dependencies

update edge primary dependencies

update secondary dependencies

Page 43: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

43

Towards Scalable FE Software

Performance Results

Page 44: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

44

Node Performance is Difficult! (B. Gropp)DiMe project: Cache-aware Multigrid (1996- ...)

grid size 173 333 653 1293 2573 5133standard 1072 1344 715 677 490 579no blocking 2445 1417 995 1065 849 8192x blocking 2400 1913 1312 1319 1284 12823x blocking 2420 2389 2167 2140 2134 2049

Performance of 3D-MG-Smoother for 7-pt stencil in Mflops on Itanium 1.4 GHzArray PaddingTemporal blocking - in EPIC assembly languageSoftware pipelineing in the extreme (M. Stürmer - J. Treibig)

Node Performance is Possible!

Page 45: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

45

Single Processor HHG Performance on Itanium forRelaxation of a Tetrahedral Finite Element Mesh

Page 46: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

46

#Proc #unkn. x 106 Ph.1: sec Ph. 2: sec Time to sol.

4 134.2 3.16 6.38* 37.98 268.4 3.27 6.67* 39.3

16 536.9 3.35 6.75* 40.332 1,073.7 3.38 6.80* 40.6

64 2,147.5 3.53 4.92 42.3128 4,295.0 3.60 7.06* 43.2

252 8,455.7 3.87 7.39* 46.4504 16,911.4 3.96 5.44 47.6

2040 68,451.0 4.92 5.60 59.03825 128,345.7 6.90 82.8

4080 136,902.0 5.68

6102 205,353.1 6.33

8152 273,535.7 7.43*

9170 307,694.1 7.75*

Parallel scalability of scalar elliptic problemdiscretized by tetrahedral finite elements.

Times for 12 V(2,2) cycles on SGI Altix: Itanium-2 1.6 GHz.

Largest problem solved to date:3.07 x 1011 DOFS on 9170 Procs: 7.8 s per V(2,2) cycle

B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006, also: „Is 1.7× 1010 unknowns the largest finite element system that can be solved today?“, SuperComputing, Nov‘ 2005.

Page 47: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

47

So what?With scalable algorithms, well implemented, we can do (scalar) PDEs with

> 10 million unknows on a desktop> 300 billion unknowns on a TOP-50 class machine (HLRB-II, 63 Tflop peak, 40 TeraByte Mem)

In the future, we will be able to doaround 2010: ≈ 5 trillion unknowns on a PetaScale machine (assuming 1 PByte memory)around 2012-2015: ≈ 50 trillion unknowns on a machine delivering a petaflop for real applications (assuming 10 Pbyte memory)

This is e.g. sufficient to resolve all of earth‘s atmosphere with10 km grid resolution (current desktop)250 m mesh (current supercomputer)100 m mesh (Peak-Peta-Scale system in 2010?)50 m mesh (Application-Peta-Scale system in 2015?)

This is a buidling block for many other applications

Page 48: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

48

Programming techniques

Seemingly conflicting goals:Portability/Flexibility:Code should run on a variety of (parallel) target platforms, including PC clusters, NUMA machines, etc.

Efficiency:Code should run as efficiently as possible on each target platform

How can this conflict be solved?

Page 49: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

49

Part IV

Multicore Architectures

Page 50: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

50

The STI Cell Processorhybrid multicore processor based on IBM Power architecture(simplified) PowerPC core

runs operating systemcontrols execution of programs

multiple co-processors (8, on Sony PS3 only 6 available)operate on fast, private on-chip memoryoptimized for computationDMA controller copies data from/to main memory

• multi-buffering can hide main memory latencies completely for streaming-like applications

• loading local copies has low and known latencies

memory with multiple channels and links can be exploited if many memory transactions are in-flight

Page 51: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

51

The STI Cell Broadband Engine

Page 52: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

52

Cell LBM SimulationsGoal: demanding (flow) simulations at moderate cost but very fast, e.g. for simulation of blood-flow in an aneurysm for therapy and surgery planningAvailable cell systems:

BladesPlaystation 3

Page 53: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

53

Synergistic Processor Unit“very small computer of its own”

128 128-bit all-purpose registersoperates on 256 kB of Local Store (LS)

nearly all operations are SIMD onlyone scalar operation is more expensive than a SIMDonly load and store of 16 naturally aligned bytes from/to LS

25.6 GFlops (single precision fused-multiply-add)only truncation, fast double precision will be available soon

no dynamic branch prediction, only hints in softwarebut around 20 cycles branch miss penalty

no system calls or privileged operations

Page 54: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

54

Memory Flow Controllercommunication interface (to PPE and other SPEs)

mailboxes and signal notificationmemory mapping of Local Store and register fileutilized by PPU to upload programs and control SPU

asynchronous data transfers (DMA)LS <-> main memory, other LSes or devices16 DMAs in-flightlist transfers possible (scatter / gather)only naturally aligned transfers of 1, 2, 4, 8, n·16 bytesusually multiple transfers on multiple MFCs are necessary to saturate main memory bandwidth

all interaction with SPU through channel interface

Page 55: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

55

Programming the Cell-BEthe hard way

control SPEs using management librariesissue DMAs by language extensionsdo address calculations manuallyexchange main memory addresses, array sizes etc.synchronization using mailboxes, signals or libraries

frameworksAccelerated Library Framework (ALF) and Data, Communication, and Synchronization (DaCS) by IBMRapidmind SDK

accelerated librariessingle-source-compiler

IBM’s xlc-cbe-sse is in alpha stage, uses OpenMP

Page 56: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

56

Naive SPU implementation: A[] = A[]*cvolatile vector float ls_buffer[8] __attribute__((aligned(128)));

void scale( unsigned long long gs_buffer, // main memory address of vector

int number_of_chunks, // number of chunks of 32 floats

float factor ) { // scaling factor

vector float v_fac = spu_splats(factor); // create SIMD vector with all

// four elements being factor

for ( int i = 0 ; i < number_of_chunks ; ++i ) {

mfc_get( ls_buffer , gs_buffer , 128 , 0 ,0,0); // DMA reading i-th chunk

mfc_write_tag_mask( 1 << 0 ); // wait for DMA...

mfc_read_tag_status_all(); // ...to complete

for ( int j = 0 ; j < 8 ; ++j )

ls_buffer[j] = spu_mul( ls_buffer[j] , v_fac ); // scale local copy using SIMD

mfc_put( ls_buffer ,gs_buffer , 128 , 0 ,0,0); // DMA writing i-th chunk

mfc_write_tag_mask( 1 << 0 ); // wait for DMA...

mfc_read_tag_status_all(); // ...to complete

gs_buffer += 128; // incr. global store pointer

} }

Page 57: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

57

Remove latencies using multi-buffering

mfc_get( ls_buffer[0] , gs_buffer , 128 , 0 ,0,0); // request first chunk

for (int i = 0; i < number_of_chunks; ++i) {

int cur = ( i ) % 3; // buffer no. and DMA tag for i-th chunk

int next = (i+1) % 3; // " for (i-2)-th and (i+1)-th chunk

if (i < number_of_chunks-1) {

mfc_write_tag_mask( 1 << next ); // make sure the (i-2)-th chunk...

mfc_read_tag_status_all(); // ...has been stored

mfc_get( ls_buffer[next] , gs_buffer+128 , 128 , next ,0,0); // request (i+1)-th chunk

}

mfc_write_tag_mask( 1 << cur ); // wait until i-th chunk...

mfc_read_tag_status_all(); // ...is available

for (int j = 0; j < 8; ++j) ls_buffer[cur][j] = spu_mul(ls_buffer[cur][j],v_fac);

mfc_put( ls_buffer[cur] , gs_buffer , 128 , cur ,0,0);// store i-th chunk

gs_buffer += 128;

}

mfc_write_tag_mask( 1 | 2 | 4 ); // wait for any...

mfc_read_tag_status_all(); // outstanding DMA

volatile vector float ls_buffer[3][8] __attribute__((aligned(128)));

...

Page 58: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

58

Part V

Case study: Lattice Boltzmann Methods for Flow Simulation on the Play Station

Page 59: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

59

Example OMP-parallel Flow AnimationResolution: 880*880*336; 260M cells, 6.5M active on average

Page 60: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

60

Simulationof Metal Foams

Joint work with C. Körner, WTM Erlangen

Page 61: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

61

Aneurysms• Aneurysm are local dilatations of the blood vessels• Localized mostly at large arteries in soft tissue (e.g.

aorta, brain vessels)• Can be diagnosed by modern imaging techniques (e.g.

MRT,DSA)• Can be treated e.g by clipping or coiling

Page 62: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

62

A data structure for simulating flow in blood vessels

• In a brain geometry only about 3-10% of the nodes are fluidal

• We use a domain decoupling in equally sized blocks, so-called patches and only allocate patches containing fluid cells

• The memory requirements and the computational time could be reduced significantly

• For the Cell processor we use patches of size 8x8x8, fitting into the SPU local memory

Page 63: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

63

Results

Velocity near the wall in an aneurysm

Oscillatory shear stress near the wall in an aneurysm

Page 64: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

64

Pulsating Blood Flow in an Aneurysm

Datensatz

Collaboration between:Neuro-Radiology (Prof. Dörfler, Dr. Richter)

Computer Science

Simulation

Imaging

CFD

Page 65: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

65

LBM Optimized for Cell

memory layoutoptimized for DMA transfersinformation propagating between patches is reordered on the SPE and stored sequentially in memory for simple and fast exchange

code optimizationkernels hand-optimized in assembly codeSIMD-vectorized streaming and collisionbranch-free handling of bounce-back boundary conditions

Page 66: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

66

Performance Results

0

12,5

25,0

37,5

50,0

Xeon 5160 PPE SPE*

49,0

2,04,8

10,4

straight-forward C codeSIMD-optimized assembly*on Local Store without DMA transfers

Page 67: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

67

Performance Results

30,0

47,5

65,0

82,5

100,0

1 2 3 4 5 6

95949493

81

42

Page 68: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

68

Performance Results

0

12,5

25,0

37,5

50,0

Xeon 5160* Playstation 3

43,8

11,7

21,1

9,1

1 core1 CPU

*performance optimized code by LB-DC

Page 69: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

Other work:LBM on Graphics Hardware

see also: work by Jonas Tölke and M. Krafczyk at TU BraunschweigMaster thesis by J. Habich (co-suprvised jointly with G. Wellein, RRZE Erlangen)

nVidia Geforce 8800 GTX (G80 Processor) up to 250 Fluid MLUP/safter careful tuning!

69

Page 70: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

Multigrid on Cell ProcessorMaster Thesis by Daniel Ritter:

A Fast Multigrid Solver for Molecular Dynamics on the Cell Broadband Engine

Performance limited by available bandwidthLocal store too small (?) for blocking techniques

70

Page 71: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

71

Part IV

Conclusions

Page 72: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

72

What have we learned?

The future is parallel on multi core CPUsMemory bandwidth per core will be a severe bottleneck

“inverse Moore’s law”Programming current leading edge multi-core architectures to exploit their performance potential requires expert knowledge of the architecture

better tool and system support neededcomplexity of the architecture

Page 73: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

73

An HPC Tutorial !Getting Supercomputer Performance is Easy!

If parallel efficiency is bad, choose a slower serial algorithmit is probably easier to parallelizeand will make your speedups look much more impressive

Introduce the “CrunchMe” variable for getting high Flops ratesadvanced method: disguise CrunchMe by using an inefficient (but compute-intensive) algorithm from the start

Introduce the “HitMe” variable to get good cache hit ratesadvanced version: disguise HitMe within “clever data structures” that introduce a lot of overhead

Never cite “time-to-solution”who cares whether you solve a real life problem anywayit is the MachoFlops that interest the people who pay for your research

Never waste your time by trying to use a complicated algorithm in parallel (such as multigrid)

the more primitive the algorithm the easier to maximize your MachoFlops.

Page 74: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität

74

Talk is Over

Questions?