1
Optimizations in GPU: Smart Compilers and Core-level Reconfiguration Deming Chen, University of Illinois at Urbana-Champaign Graphics processing units (GPUs) are increasingly critical for general-purpose parallel processing performance. GPU hardware is composed of many streaming multiprocessors, allowing GPUs to execute tens of thousands of threads in parallel. However, due to the SIMD (single-instruction multiple-data) execution style, resource utilization and thus overall performance can be significantly affected if computation threads must take diverging control paths. Meanwhile, tuning GPU applications’ performance is also a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. New GPU architecture also allows concurrent kernel executions which introduces interesting kernel scheduling problems. In the first part of the talk, we will mainly introduce our recent studies on control flow optimization, joint optimization of register allocation and thread structure, and concurrent kernel scheduling, for GPU performance improvements. Energy efficiency of GPUs for general-purpose computing is increasingly important as well. The integration of GPUs onto SoCs for use in mobile devices in the last 5 years has further exacerbated the need to reduce the energy foot print of GPUs. In the second part of the talk, we propose a novel GPU architecture that makes use of reconfiguration to exploit ILP and DVFS (Dynamic Voltage and Frequency Scaling) techniques to reduce the power consumption, without sacrificing the computational throughput. We expect that applications with large amounts of ILP should see dramatic improvements in their energy and power, when compared to nominal CUDA-based architectures. In addition to this, we foresee interesting challenges with respect to scheduling of threads and the re-organization of CUDA warp structures and schedules. We also note that dynamic reconfiguration of cores within a SIMD unit (SM in CUDA), affects the number of threads that can execute concurrently and thus would change the number of effective warps in flight, which may affect the capability to overlap execution time and memory latency.

[IEEE 2013 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP) - Austin, TX, USA (2013.06.2-2013.06.2)] 2013 ACM/IEEE International Workshop on System Level

  • Upload
    deming

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP) - Austin, TX, USA (2013.06.2-2013.06.2)] 2013 ACM/IEEE International Workshop on System Level

Optimizations in GPU: Smart Compilers and Core-level Reconfiguration Deming Chen, University of Illinois at Urbana-Champaign

Graphics processing units (GPUs) are increasingly critical for general-purpose parallel processing performance. GPU hardware is composed of many streaming multiprocessors, allowing GPUs to execute tens of thousands of threads in parallel. However, due to the SIMD (single-instruction multiple-data) execution style, resource utilization and thus overall performance can be significantly affected if computation threads must take diverging control paths. Meanwhile, tuning GPU applications’ performance is also a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. New GPU architecture also allows concurrent kernel executions which introduces interesting kernel scheduling problems. In the first part of the talk, we will mainly introduce our recent studies on control flow optimization, joint optimization of register allocation and thread structure, and concurrent kernel scheduling, for GPU performance improvements.

Energy efficiency of GPUs for general-purpose computing is increasingly important as well. The integration of GPUs onto SoCs for use in mobile devices in the last 5 years has further exacerbated the need to reduce the energy foot print of GPUs. In the second part of the talk, we propose a novel GPU architecture that makes use of reconfiguration to exploit ILP and DVFS (Dynamic Voltage and Frequency Scaling) techniques to reduce the power consumption, without sacrificing the computational throughput. We expect that applications with large amounts of ILP should see dramatic improvements in their energy and power, when compared to nominal CUDA-based architectures. In addition to this, we foresee interesting challenges with respect to scheduling of threads and the re-organization of CUDA warp structures and schedules. We also note that dynamic reconfiguration of cores within a SIMD unit (SM in CUDA), affects the number of threads that can execute concurrently and thus would change the number of effective warps in flight, which may affect the capability to overlap execution time and memory latency.