13
27-03-22 Challenge the future Delft University of Technology Evaluating Multi-Core Processors for Data- Intensive Kernels Alexander van Amesfoort Delft University of Technology [email protected] joint work with Ana Varbanescu, Rob van Nieuwpoort, and Henk Sips

2-6-2015 Challenge the future Delft University of Technology Evaluating Multi-Core Processors for Data-Intensive Kernels Alexander van Amesfoort Delft

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

18-04-23

Challenge the future

DelftUniversity ofTechnology

Evaluating Multi-Core Processors for Data-Intensive Kernels

Alexander van AmesfoortDelft University of [email protected]

joint work with Ana Varbanescu, Rob van Nieuwpoort, and Henk Sips

2Evaluating Multi-Core Processors for Data-Intensive Kernels

Outline

• Data-intensive applications• Gridding• Platforms• Implementation strategies• Measurements• Guidelines• Conclusion

3Evaluating Multi-Core Processors for Data-Intensive Kernels

Data-Intensive Applications

• Low Arithmetic Intensity (comp : comm)- drastic application- and platform-specific effort

• Difficult platform choice

• Memory wall is still getting bigger• Data-intensive study worthwhile

• Provide guidelines and insight into performance behavior and effort

4Evaluating Multi-Core Processors for Data-Intensive Kernels

Radio Astronomy Imaging

• Gridding places irregularly spaced samples on a regular grid

• (de)gridding consumes most time in imaging• Use gridding as a HPC streaming kernel

5Evaluating Multi-Core Processors for Data-Intensive Kernels

Gridding (W Projection)

• Unpredictable, sparse access patterns

• Low AI (0.33)

forall (i = 0..N_freq; j = 0..N_samples) // for all samples

g_index = func1((u, v, w)[j], freq[i]);c_index = func2((u, v, w)[j], freq[i]);for (x = 0; x < SUPPORT; x++) // sweep the

convolution kernelG[g_index+x] += C[C_index+x] * V[i,j];

• Parameterize these properties

6Evaluating Multi-Core Processors for Data-Intensive Kernels

Platforms and Test Setup

• High provided Flop/Byte ratios

Platform Cores Clock(GHz)

Local Mem (kB)

Compute (GFlop/s)

Flop/Byte Ratio

Dual Xeon 5320 2x4 1.86 32 + 32 L12 x 4096 L2

59.5 2.8

Core i7 920 4 (HT) 2.66 32 + 32 L1256 L28192 L3

85.2 2.7

PS3 Cell 1+6 3.20 256 153.6 6.0

QS21 Cell 2+16 3.20 256 409.6 8.0

Geforce 8800 GTX 16 1.35 16+8+8 345.0 4.0

Geforce GTX 280 30 1.30 16+8+8 936.0 6.6

7Evaluating Multi-Core Processors for Data-Intensive Kernels

Implementation Strategies

• CPU (pthreads)- replicated grid, master-worker queues, SIMD

• Cell/B.E. (Cell SDK)- master-worker queues, SIMD, double buffering, PPE

multi-threading, line reuse• GPU (CUDA)

- replicated grid, 1D texturing of the convolution matrix

• Similar at a high level- but different, non-portable code

8Evaluating Multi-Core Processors for Data-Intensive Kernels

CPU Experiments

• Core i7 suffers less from irregular accesses• Still 3x more locality needed• Hyperthreading shows a lot of benefit

9Evaluating Multi-Core Processors for Data-Intensive Kernels

Cell/B.E. Experiments

• Achieves the highest performance• Could perform much better with more work• Some optimizations were applied to the computation

10Evaluating Multi-Core Processors for Data-Intensive Kernels

GPU Experiments

• Write conflicts in the grid problematic• Also requires much more work/locality• Tesla C1040 results unexplainable

11Evaluating Multi-Core Processors for Data-Intensive Kernels

Discussion

• Reached good speedups, but still way below peak• A lot of effort

• Best performance on Cell/B.E.- depends on application requirements

• GPUs suitable for lots of data parallelism- can exploit 2D or 3D spatial locality

• Don’t underestimate standard CPUs- flexibility, availability, cost, and ease of programming

12Evaluating Multi-Core Processors for Data-Intensive Kernels

Guidelines

• Good performance requires:- regular data accesses- data reuse between independent samples

• Or else: suffer (redesign algorithm)- conceptually, resolve irregularity at a higher level- avoid write conflicts- stream jobs: overlap/multi-buffering in the hierarchy- parameterized job size

13Evaluating Multi-Core Processors for Data-Intensive Kernels

Conclusion

• Challenges:- platform choice- fitting the application onto the platform

• Similar strategies, different implementation

• Provided guidelines focussing at memory and data optimization

- or change the algorithm