View
213
Download
0
Embed Size (px)
Citation preview
18-04-23
Challenge the future
DelftUniversity ofTechnology
Evaluating Multi-Core Processors for Data-Intensive Kernels
Alexander van AmesfoortDelft University of [email protected]
joint work with Ana Varbanescu, Rob van Nieuwpoort, and Henk Sips
2Evaluating Multi-Core Processors for Data-Intensive Kernels
Outline
• Data-intensive applications• Gridding• Platforms• Implementation strategies• Measurements• Guidelines• Conclusion
3Evaluating Multi-Core Processors for Data-Intensive Kernels
Data-Intensive Applications
• Low Arithmetic Intensity (comp : comm)- drastic application- and platform-specific effort
• Difficult platform choice
• Memory wall is still getting bigger• Data-intensive study worthwhile
• Provide guidelines and insight into performance behavior and effort
4Evaluating Multi-Core Processors for Data-Intensive Kernels
Radio Astronomy Imaging
• Gridding places irregularly spaced samples on a regular grid
• (de)gridding consumes most time in imaging• Use gridding as a HPC streaming kernel
5Evaluating Multi-Core Processors for Data-Intensive Kernels
Gridding (W Projection)
• Unpredictable, sparse access patterns
• Low AI (0.33)
forall (i = 0..N_freq; j = 0..N_samples) // for all samples
g_index = func1((u, v, w)[j], freq[i]);c_index = func2((u, v, w)[j], freq[i]);for (x = 0; x < SUPPORT; x++) // sweep the
convolution kernelG[g_index+x] += C[C_index+x] * V[i,j];
• Parameterize these properties
6Evaluating Multi-Core Processors for Data-Intensive Kernels
Platforms and Test Setup
• High provided Flop/Byte ratios
Platform Cores Clock(GHz)
Local Mem (kB)
Compute (GFlop/s)
Flop/Byte Ratio
Dual Xeon 5320 2x4 1.86 32 + 32 L12 x 4096 L2
59.5 2.8
Core i7 920 4 (HT) 2.66 32 + 32 L1256 L28192 L3
85.2 2.7
PS3 Cell 1+6 3.20 256 153.6 6.0
QS21 Cell 2+16 3.20 256 409.6 8.0
Geforce 8800 GTX 16 1.35 16+8+8 345.0 4.0
Geforce GTX 280 30 1.30 16+8+8 936.0 6.6
7Evaluating Multi-Core Processors for Data-Intensive Kernels
Implementation Strategies
• CPU (pthreads)- replicated grid, master-worker queues, SIMD
• Cell/B.E. (Cell SDK)- master-worker queues, SIMD, double buffering, PPE
multi-threading, line reuse• GPU (CUDA)
- replicated grid, 1D texturing of the convolution matrix
• Similar at a high level- but different, non-portable code
8Evaluating Multi-Core Processors for Data-Intensive Kernels
CPU Experiments
• Core i7 suffers less from irregular accesses• Still 3x more locality needed• Hyperthreading shows a lot of benefit
9Evaluating Multi-Core Processors for Data-Intensive Kernels
Cell/B.E. Experiments
• Achieves the highest performance• Could perform much better with more work• Some optimizations were applied to the computation
10Evaluating Multi-Core Processors for Data-Intensive Kernels
GPU Experiments
• Write conflicts in the grid problematic• Also requires much more work/locality• Tesla C1040 results unexplainable
11Evaluating Multi-Core Processors for Data-Intensive Kernels
Discussion
• Reached good speedups, but still way below peak• A lot of effort
• Best performance on Cell/B.E.- depends on application requirements
• GPUs suitable for lots of data parallelism- can exploit 2D or 3D spatial locality
• Don’t underestimate standard CPUs- flexibility, availability, cost, and ease of programming
12Evaluating Multi-Core Processors for Data-Intensive Kernels
Guidelines
• Good performance requires:- regular data accesses- data reuse between independent samples
• Or else: suffer (redesign algorithm)- conceptually, resolve irregularity at a higher level- avoid write conflicts- stream jobs: overlap/multi-buffering in the hierarchy- parameterized job size