Tuning Stencils

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Tuning Stencils

Kaushik Datta

Microsoft Site Visit

April 29, 2008



Stencil Code Overview

For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)

A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor


2D Stencil 3D Stencil



Stencil Applications

Stencils are critical to many scientific applications: Diffusion, Electromagnetics, Computational Fluid Dynamics Both uniform and adaptive block-structured meshes

Many type of stencils 1D, 2D, 3D meshes Number of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…) Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes)

Varying boundary conditions (constant vs. periodic)



Naïve Stencil Code

void stencil3d(double A[], double B[], int nx, int ny, int nz) {for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] +

S1*(A[top] + A[bottom] + A[left] + A[right] +

A[front] + A[back]); }

} }}



Our Stencil Code

Executes a 3D, 7-point, Jacobi iteration on a 2563 grid Performs 8 flops (6 adds, 2 mults) per point Parallelization performed with pthreads Thread affinity: multithreading, then multicore, then multisocket Flop:Byte Ratio

0.33 (write allocate architectures) 0.5 (Ideal)



Cache-Based Architectures

Intel Clovertown

Sun Victoria Falls

AMD Barcelona

QuickTime™ and a decompressor




667MHz DDR2 DIMMs

10.66 GB/s



2x64b memory controllers





Opteron



Opteron



Opteron



Opteron





667MHz DDR2 DIMMs

10.66 GB/s






Opteron



Opteron



Opteron



Opteron



512KB

victim



512KB

victim



512KB

victim



512KB

victim



512KB

victim



512KB

victim



512KB

victim



512KB

victim



2MB Shared quasi- victim (32 way)



SRI / crossbar






SRI / crossbar





Autotuning

Provides a portable and effective method for tuning Limiting the search space:

Searching the entire space is intractable Instead, we ordered the optimizations appropriately for a given platform To find best parameters for a given optimization, performed exhaustive

search Each optimization was applied on top of all previous optimizations In general, can also use heuristics/models to prune search space



Naive Code

Naïve code is a simple, threaded stencil kernel Domain partitioning performed only in least contiguous dimension No optimizations or tuning was performed

x

y

z (unit-stride)



Naïve

Intel Clovertown AMD Barcelona

Sun Victoria Falls

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Naive



NUMA-Aware

Intel Clovertown

Sun Victoria Falls

AMD Barcelona





667MHz DDR2 DIMMs

10.66 GB/s








Opteron



Opteron



Opteron



Opteron





667MHz DDR2 DIMMs

10.66 GB/s






Opteron



Opteron



Opteron



Opteron



512KB

victim



512KB

victim



512KB

victim



512KB

victim



512KB

victim



512KB

victim



512KB

victim



512KB

victim






SRI / crossbar






SRI / crossbar



Exploited “first-touch” page mapping policy on NUMA architectures Due to our affinity policy, benefit only seen when using both sockets



NUMA-Aware


Sun Victoria Falls Naive+NUMA-Aware

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s



Loop Unrolling/Reordering

Allows for better use of registers and functional units Best inner loop chosen by iterating many times over a grid size that

fits into L1 cache (x86 machines) or L2 cache (VF) should eliminate any effects from memory subsystem

This optimization is independent of later memory optimizations



Loop Unrolling/Reordering


Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s



Padding

x

y

z (unit-stride)

Used to reduce conflict misses and DRAM bank conflicts Drawback: Larger memory footprint Performed search to determine best padding amount Only padded in unit-stride dimension

Padding

Amount



Padding


Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s



Thread/Cache Blocking

x

y

z (unit-stride)

Thread Blocks in x: 4

Thread Blocks in z: 2Thread Blocks in y: 2

Cache Blocks in y: 2

Performed exhaustive search over all possible power-of-two parameter values

Every thread block is the same size and shape Preserves load balancing

Did NOT cut in contiguous dimension on x86 machines Avoids interrupting HW prefetchers

Only performed cache blocking in one dimension Sufficient to fit three read planes and one write plane into cache



Thread/Cache Blocking


Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s



Software Prefetching

Allows us to hide memory latency Searched over varying prefetch distances and granularities (e.g.

prefetch every register block, plane, or pencil)



Software Prefetching


Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s



SIMDization

Requires complete code rewrite to utilize 128-bit SSE registers Allows single instruction to add/multiply two doubles Only possible on the x86 machines Padding performed to achieve proper data alignment (not to avoid

conflicts) Searched over register block sizes and prefetch distances

simultaneously



SIMDization


Sun Victoria Falls

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s



Cache Bypass

Writes data directly to write-back buffer No data load on write miss

Changes stencil kernel’s flop:byte ratio from 1/3 to 1/2 Reduces memory data traffic by 33%

Still requires the SIMDized code from the previous optimization Searched over register block sizes and prefetch distances

simultaneously



Cache Bypass


Sun Victoria Falls

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization+Cache Bypass

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s



Collaborative Threading

x

y

z (unit-stride)

Thread Blocks in x: 4

Thread Blocks in z: 2Thread Blocks in y: 2

Cache Blocks in y: 2

No Collaboration With Collaboration

y

z (unit-stride)

Large Coll. TBs in y: 4Large Coll. TBs in z: 2

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

Small Coll. TBs in y: 2Small Coll. TBs in z: 4

Requires another complete code rewrite CT allows for better L1 cache utilization when switching threads Only effective on VF due to:

very small L1 cache (8 KB) shared by 8 HW threads lack of hardware prefetchers (allows us to cut in contiguous dimension)

Drawback: Parameter space becomes very large



Collaborative Threading


Sun Victoria Falls

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization+Cache Bypass+Collaborative Threading

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s



Autotuning Results


Sun Victoria Falls

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization+Cache Bypass+Collaborative Threading

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

1.9x Better 5.4x Better

10.4x Better



Architecture Comparison

4.5

16

12

34

53

3

0

10

20

30

40

50

60

Total GFlop/s

ClovertownBarcelonaVictoria FallsCellG80G80/PCIe

2.5

8

7

14

0

2

4

6

8

10

12

14

16

Total GFlop/s

ClovertownBarcelonaVictoria FallsCell

0

50

100

150

200

250

300

CtownBarc

VFallsCell G80

G80/P

MFlop/S/Watt

System PowerEfficiencyChip PowerEfficiency

Series12

Series11

Series10

Series9

Series8

Series7

Series6

Series5

Series4

Series3

Series2

Series1

0

10

20

30

40

50

60

70

80

CtownBarc

VFallsCell

MFlop/s/Watt

System Power EffChip Power Eff

G80:ChpG80:SysCell:ChpCell:SysVFall:ChpVFall:SysBarc:ChpBarc:SysCtwn:ChpCtwn:Sys

Single PrecisionDouble PrecisionP

erfo

rman

ceP

ow

er E

ffic

ien

cy



Conclusions

Compilers alone fail to fully utilize system resources Programmers may not even know that system is being underutilized Autotuning provides a portable and effective solution

Produces up to a 10.4x improvement over compiler alone To make autotuning tractable:

Choose the order of optimizations appropriately for the platform Prune the search space intelligently for large searches

Power efficiency has become a valuable metric Local store-based architectures (e.g Cell and G80) usually more

efficient than cache-based machines



Acknowledgements

Sam Williams for: writing the Cell stencil code guiding my work by autotuning SpMV and LBMHD

Vasily Volkov for writing the G80 CUDA code Kathy Yelick and Jim Demmel for general advice and feedback

Documents

Tuning Stencils