Upload
geoffrey-rios
View
53
Download
1
Embed Size (px)
DESCRIPTION
Tuning Stencils. Kaushik Datta Microsoft Site Visit April 29, 2008. Stencil Code Overview. For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself) - PowerPoint PPT Presentation
Citation preview
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Tuning Stencils
Kaushik Datta
Microsoft Site Visit
April 29, 2008
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Stencil Code Overview
For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)
A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”)
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
2D Stencil 3D Stencil
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Stencil Applications
Stencils are critical to many scientific applications: Diffusion, Electromagnetics, Computational Fluid Dynamics Both uniform and adaptive block-structured meshes
Many type of stencils 1D, 2D, 3D meshes Number of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…) Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes)
Varying boundary conditions (constant vs. periodic)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Naïve Stencil Code
void stencil3d(double A[], double B[], int nx, int ny, int nz) {for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] +
S1*(A[top] + A[bottom] + A[left] + A[right] +
A[front] + A[back]); }
} }}
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Our Stencil Code
Executes a 3D, 7-point, Jacobi iteration on a 2563 grid Performs 8 flops (6 adds, 2 mults) per point Parallelization performed with pthreads Thread affinity: multithreading, then multicore, then multisocket Flop:Byte Ratio
0.33 (write allocate architectures) 0.5 (Ideal)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Cache-Based Architectures
Intel Clovertown
Sun Victoria Falls
AMD Barcelona
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
667MHz DDR2 DIMMs
10.66 GB/s
QuickTime™ and a decompressor
are needed to see this picture.
2x64b memory controllers
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
667MHz DDR2 DIMMs
10.66 GB/s
QuickTime™ and a decompressor
are needed to see this picture.
2x64b memory controllers
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
2MB Shared quasi- victim (32 way)
QuickTime™ and a decompressor
are needed to see this picture.
SRI / crossbar
QuickTime™ and a decompressor
are needed to see this picture.
2MB Shared quasi- victim (32 way)
QuickTime™ and a decompressor
are needed to see this picture.
SRI / crossbar
QuickTime™ and a decompressor
are needed to see this picture.
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuning
Provides a portable and effective method for tuning Limiting the search space:
Searching the entire space is intractable Instead, we ordered the optimizations appropriately for a given platform To find best parameters for a given optimization, performed exhaustive
search Each optimization was applied on top of all previous optimizations In general, can also use heuristics/models to prune search space
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Naive Code
Naïve code is a simple, threaded stencil kernel Domain partitioning performed only in least contiguous dimension No optimizations or tuning was performed
x
y
z (unit-stride)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Naïve
Intel Clovertown AMD Barcelona
Sun Victoria Falls
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16
GFlops/s
Naive
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
NUMA-Aware
Intel Clovertown
Sun Victoria Falls
AMD Barcelona
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
667MHz DDR2 DIMMs
10.66 GB/s
QuickTime™ and a decompressor
are needed to see this picture.
2x64b memory controllers
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
667MHz DDR2 DIMMs
10.66 GB/s
QuickTime™ and a decompressor
are needed to see this picture.
2x64b memory controllers
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
Opteron
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
512KB
victim
QuickTime™ and a decompressor
are needed to see this picture.
2MB Shared quasi- victim (32 way)
QuickTime™ and a decompressor
are needed to see this picture.
SRI / crossbar
QuickTime™ and a decompressor
are needed to see this picture.
2MB Shared quasi- victim (32 way)
QuickTime™ and a decompressor
are needed to see this picture.
SRI / crossbar
QuickTime™ and a decompressor
are needed to see this picture.
Exploited “first-touch” page mapping policy on NUMA architectures Due to our affinity policy, benefit only seen when using both sockets
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
NUMA-Aware
Intel Clovertown AMD Barcelona
Sun Victoria Falls Naive+NUMA-Aware
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16
GFlops/s
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Loop Unrolling/Reordering
Allows for better use of registers and functional units Best inner loop chosen by iterating many times over a grid size that
fits into L1 cache (x86 machines) or L2 cache (VF) should eliminate any effects from memory subsystem
This optimization is independent of later memory optimizations
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Loop Unrolling/Reordering
Intel Clovertown AMD Barcelona
Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16
GFlops/s
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Padding
x
y
z (unit-stride)
Used to reduce conflict misses and DRAM bank conflicts Drawback: Larger memory footprint Performed search to determine best padding amount Only padded in unit-stride dimension
Padding
Amount
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Padding
Intel Clovertown AMD Barcelona
Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16
GFlops/s
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Thread/Cache Blocking
x
y
z (unit-stride)
Thread Blocks in x: 4
Thread Blocks in z: 2Thread Blocks in y: 2
Cache Blocks in y: 2
Performed exhaustive search over all possible power-of-two parameter values
Every thread block is the same size and shape Preserves load balancing
Did NOT cut in contiguous dimension on x86 machines Avoids interrupting HW prefetchers
Only performed cache blocking in one dimension Sufficient to fit three read planes and one write plane into cache
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Thread/Cache Blocking
Intel Clovertown AMD Barcelona
Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16
GFlops/s
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Software Prefetching
Allows us to hide memory latency Searched over varying prefetch distances and granularities (e.g.
prefetch every register block, plane, or pencil)
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Software Prefetching
Intel Clovertown AMD Barcelona
Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16
GFlops/s
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SIMDization
Requires complete code rewrite to utilize 128-bit SSE registers Allows single instruction to add/multiply two doubles Only possible on the x86 machines Padding performed to achieve proper data alignment (not to avoid
conflicts) Searched over register block sizes and prefetch distances
simultaneously
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SIMDization
Intel Clovertown AMD Barcelona
Sun Victoria Falls
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16
GFlops/s
Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Cache Bypass
Writes data directly to write-back buffer No data load on write miss
Changes stencil kernel’s flop:byte ratio from 1/3 to 1/2 Reduces memory data traffic by 33%
Still requires the SIMDized code from the previous optimization Searched over register block sizes and prefetch distances
simultaneously
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Cache Bypass
Intel Clovertown AMD Barcelona
Sun Victoria Falls
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16
GFlops/s
Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization+Cache Bypass
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Collaborative Threading
x
y
z (unit-stride)
Thread Blocks in x: 4
Thread Blocks in z: 2Thread Blocks in y: 2
Cache Blocks in y: 2
No Collaboration With Collaboration
y
z (unit-stride)
Large Coll. TBs in y: 4Large Coll. TBs in z: 2
t0 t1
t2 t3
t4 t5
t6 t7
t0 t1
t2 t3
t4 t5
t6 t7
t0 t1
t2 t3
t4 t5
t6 t7
t0 t1
t2 t3
t4 t5
t6 t7
t0 t1
t2 t3
t4 t5
t6 t7
t0 t1
t2 t3
t4 t5
t6 t7
t0 t1
t2 t3
t4 t5
t6 t7
t0 t1
t2 t3
t4 t5
t6 t7
Small Coll. TBs in y: 2Small Coll. TBs in z: 4
Requires another complete code rewrite CT allows for better L1 cache utilization when switching threads Only effective on VF due to:
very small L1 cache (8 KB) shared by 8 HW threads lack of hardware prefetchers (allows us to cut in contiguous dimension)
Drawback: Parameter space becomes very large
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Collaborative Threading
Intel Clovertown AMD Barcelona
Sun Victoria Falls
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization+Cache Bypass+Collaborative Threading
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16
GFlops/s
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuning Results
Intel Clovertown AMD Barcelona
Sun Victoria Falls
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
0
1
2
3
4
5
6
7
8
9
1 2 4 8
GFlops/s
Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization+Cache Bypass+Collaborative Threading
0
1
2
3
4
5
6
7
8
9
1 2 4 8 16
GFlops/s
1.9x Better 5.4x Better
10.4x Better
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Architecture Comparison
4.5
16
12
34
53
3
0
10
20
30
40
50
60
Total GFlop/s
ClovertownBarcelonaVictoria FallsCellG80G80/PCIe
2.5
8
7
14
0
2
4
6
8
10
12
14
16
Total GFlop/s
ClovertownBarcelonaVictoria FallsCell
0
50
100
150
200
250
300
CtownBarc
VFallsCell G80
G80/P
MFlop/S/Watt
System PowerEfficiencyChip PowerEfficiency
Series12
Series11
Series10
Series9
Series8
Series7
Series6
Series5
Series4
Series3
Series2
Series1
0
10
20
30
40
50
60
70
80
CtownBarc
VFallsCell
MFlop/s/Watt
System Power EffChip Power Eff
G80:ChpG80:SysCell:ChpCell:SysVFall:ChpVFall:SysBarc:ChpBarc:SysCtwn:ChpCtwn:Sys
Single PrecisionDouble PrecisionP
erfo
rman
ceP
ow
er E
ffic
ien
cy
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Conclusions
Compilers alone fail to fully utilize system resources Programmers may not even know that system is being underutilized Autotuning provides a portable and effective solution
Produces up to a 10.4x improvement over compiler alone To make autotuning tractable:
Choose the order of optimizations appropriately for the platform Prune the search space intelligently for large searches
Power efficiency has become a valuable metric Local store-based architectures (e.g Cell and G80) usually more
efficient than cache-based machines
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Acknowledgements
Sam Williams for: writing the Cell stencil code guiding my work by autotuning SpMV and LBMHD
Vasily Volkov for writing the G80 CUDA code Kathy Yelick and Jim Demmel for general advice and feedback