91
Cross-Platform Heterogeneous Runtime Environment A Dissertation Presented by Enqiang Sun to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Northeastern University Boston, Massachusetts April 2016

Dissertation: Cross-Platform Heterogeneous Runtime Environment

  • Upload
    vonhi

  • View
    245

  • Download
    0

Embed Size (px)

Citation preview

  • Cross-Platform Heterogeneous Runtime Environment

    A Dissertation Presented

    by

    Enqiang Sun

    to

    The Department of Electrical and Computer Engineering

    in partial fulfillment of the requirements

    for the degree of

    Doctor of Philosophy

    in

    Computer Engineering

    Northeastern University

    Boston, Massachusetts

    April 2016

  • To my family.

    i

  • Contents

    List of Figures iv

    List of Tables vi

    Acknowledgments vii

    Abstract of the Dissertation viii

    1 Introduction 11.1 A Brief History of Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . 41.2 Heterogeneous Computing with OpenCL . . . . . . . . . . . . . . . . . . . . . . 51.3 Task-level Parallelism across Platforms with Multiple Computing Devices . . . . . 51.4 Scope and Contribution of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Organization of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 Background and Related Work 92.1 From Serial to Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Many-Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Programming Paradigms for Many Core Architecture . . . . . . . . . . . . . . . . 14

    2.3.1 Pthreads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.3 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.4 Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.4 Computing with Graphic Processing Units . . . . . . . . . . . . . . . . . . . . . . 162.4.1 The Emergence of Programmable Graphics Hardware . . . . . . . . . . . 162.4.2 General Purpose GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.5 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.1 An OpenCL Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5.2 OpenCL Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5.3 OpenCL Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.6 Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    2.7 SURF in OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.8 Monte Carlo Extreme in OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    ii

  • 3 Cross-platform Heterogeneous Runtime Environment 393.1 Limitations of the OpenCL Command-Queue Approach . . . . . . . . . . . . . . . 39

    3.1.1 Working with Multiple Devices . . . . . . . . . . . . . . . . . . . . . . . 393.2 The Task Queuing Execution System . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.2.1 Work Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.2 Work Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.3 Common Runtime Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.4 Resource Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.5 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.6 Task-Queuing API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    4 Experimental Results 504.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2 Static Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    4.2.1 Performance Opportunities on a Single GPU Device . . . . . . . . . . . . 514.2.2 Heterogeneous Platform with Multiple GPU Devices . . . . . . . . . . . . 524.2.3 Heterogeneous Platform with CPU and GPU(APU) Device . . . . . . . . . 56

    4.3 Design Space Exploration for Flexible Workload Balancing . . . . . . . . . . . . . 594.3.1 Synthetic Workload Generator . . . . . . . . . . . . . . . . . . . . . . . . 594.3.2 Dynamic Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . 594.3.3 Workload Balancing with Irregular Work Units . . . . . . . . . . . . . . . 63

    4.4 Cross-Platform Heterogeneous Execution of clSURF and MCXCL . . . . . . . . . 644.5 Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    5 Summary and Conclusions 695.1 Portable Execution across Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Dynamic Workload Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3 APIs to Expose Both Task-level and Data-level Parallelism . . . . . . . . . . . . . 705.4 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    5.4.1 Including Flexible Workload Balancing Schemes . . . . . . . . . . . . . . 705.4.2 Running specific kernels on the best computing devices . . . . . . . . . . . 715.4.3 Prediction of data locality . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    Bibliography 73

    iii

  • List of Figures

    1.1 Multi-core CPU, GPU, and Heterogeneous System-on-Chip CPU and GPU. Atpresent, designers are able to make decisions among diverse architecture choices:homogeneous multi-core with cores of various size of complexity or heterogeneoussystem-on-chip architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.1 Intel Processors Introduction Trends . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Multi-core Processors with Shared Memory . . . . . . . . . . . . . . . . . . . . . 112.3 Intels TeraFlops Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Intels Xeon Phi Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 IBMs Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6 High Level Block Diagram of a GPU . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 OpenCL Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.8 An OpenCL Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.9 The OpenCL Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.10 OpenCL work-items mapping to GPU devices. . . . . . . . . . . . . . . . . . . . 232.11 OpenCL work-items mapping to CPU devices. . . . . . . . . . . . . . . . . . . . . 242.12 The OpenCL Memory Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.13 Qilin Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.14 The OpenCL environment with the IBM OpenCL common runtime. . . . . . . . . 302.15 Maestros Optimization Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.16 Symphony Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.17 The Program Flow of clSURF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.18 Block Diagram of the Parallel Monte Carlo simulation for photon migration. . . . . 38

    3.1 Distributing work units from work pools to multiple devices. . . . . . . . . . . . . 413.2 CPU and GPU execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3 An example execution of vector addition on multiple devices with different process-

    ing capabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    4.1 The performance of our work pool implementation on a single device One WorkPool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    4.2 The performance of our work pool implementation on a single device Two WorkPools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    4.3 Load balancing on dual devices V9800P and HD6970. . . . . . . . . . . . . . . 55

    iv

  • 4.4 Load balancing on dual devices V9800P and GTX 285. . . . . . . . . . . . . . . 564.5 Performance assuming different device fission configurations and load balancing

    schemes between CPU and Fused HD6550D GPU. . . . . . . . . . . . . . . . . . 574.6 The Load Balancing on dual devices HD6550D and CPU. . . . . . . . . . . . . . 584.7 Performance of different workload balancing schemes on all 3 CPU and GPU devices,

    an A8-8350 CPU, a V7800 GPU and a HD6550D GPU, as compared to a V7800GPU device alone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4.8 Performance of different workload balancing schemes on 1 CPU and 2 GPU devices,an NVS 5400M GPU, a Core i5-3360M CPU and a Intel HD graphics 4000 GPU, ascompared to the NVS 5400M GPU device alone. . . . . . . . . . . . . . . . . . . 63

    4.9 Performance Comparison of clSURF implemented with various workload balancingschemes on the platform with V7800 and HD6550D GPUs. . . . . . . . . . . . . . 65

    4.10 Performance Comparison of MCXCL implemented with various workload balancingschemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    4.11 Number of lines of the source code using our runtime API versus a baseline OpenCLimplementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    v

  • List of Tables

    3.1 Typical memory bandwidth between different processing units for reads. . . . . . . 473.2 Typical memory bandwidth between different processing units for writes. . . . . . 473.3 The Classes and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    4.1 Input sets emphasizing different phases of the SURF algorithm. . . . . . . . . . . . 52

    vi

  • Acknowledgments

    First, I would like to thank my advisor, Prof. David Kaeli for his insightful and inspiringguidance during the course of my graduate study. I have always enjoyed talking with him on variousresearch topics, and his computer architecture class is one of the best classes I have ever taken.

    The enlightening suggestions from my committee of Prof. Norman Rubin and Prof.Ningfang Mi have been a great help to this thesis. Norm was my mentor when I was doing a 6-monthinternship at AMD, and thats where this thesis essentially started.

    I would also like to thank Dr. Xinping Zhu, who gave me valuable guidance for my earlygraduate study. My fellow NUCAR colleagues, Dana Schaa, Byunghyun Jang, Perhaad Mistry, etc,also helped me so much through technical discussions and feedback. If life is a train ride, I cherishevery moment and every scene outside of the window we share together.

    My deepest appreciation goes to my family, as it is always where I can put myself togetherwith their endless love. I would like to thank my mom and dad for their consistent support andmotivation, and my brother for his advice and encouragement. And finally, but most importantly, Iwould like to thank my wife and mother of my two kids, Liwei, for her understanding, patience, andfaith in me. I couldnt have finished this thesis without her love.

    vii

  • Abstract of the Dissertation

    Cross-Platform Heterogeneous Runtime Environment

    by

    Enqiang Sun

    Doctor of Philosophy in Computer Engineering

    Northeastern University, April 2016

    Dr. David Kaeli, Adviser

    Heterogeneous platforms are becoming widely adopted thanks to the support from newprogramming languages and models. Among these languages/models, OpenCL is an industrystandard for parallel programming on heterogeneous devices. With OpenCL, compute-intensiveportions of an application can be offloaded to a variety of processing units on a system. OpenCL isone of the first standards that focuses on portability, allowing programs to be written once and rununmodified on multiple, heterogeneous devices, regardless of vendor.

    While OpenCL has been widely adopted, there still remains a lack of support for automaticworkload balancing and data consistency when multiple devices are present in the system. To addressthis need, we have designed a cross-platform heterogeneous runtime environment which provides ahigh-level, unified, execution model that is coupled with an intelligent resource management facility.The main motivation for developing this runtime environment is to provide OpenCL programmerswith a convenient programming paradigm to fully utilize all possible devices in a system andincorporate flexible workload balancing schemes without compromising the users ability to assigntasks according to the data affinity. Our work removes much of the cumbersome initialization of theplatform, and now devices and related OpenCL objects are hidden under the hood.

    Equipped with this new runtime environment and associated programming interface, theprogrammer can focus on designing the application and worry less about customization to the targetplatform. Further, the programmer can now take advantage of multiple devices using a dynamicworkload balancing algorithm to reap the benefits of task-level parallelism.

    To demonstrate the value of this cross-platform heterogeneous runtime environment, wehave evaluated it running both micro benchmarks and popular OpenCL benchmark applications. Withminimal overhead of managing data objects across devices, we show that we can achieve scalableperformance and application speedup as we increase the number of computing devices, without anychanges to the program source code.

    viii

  • ix

  • Chapter 1

    Introduction

    Moores law describes technology advances that double transistor density on integrated

    circuits every 12 to 18 months [1]. However, with the size of transistors approaching the size of

    individual atoms, and as power density outpaces current cooling techniques, the end of Moores law

    has appeared on the horizon. This has encouraged the research community to look at new solutions

    in system architecture, including heterogeneous computing architectures.

    Since 2003, the semiconductor industry has settled on three main trends for microprocessors

    design. The first trend is to continue improving the sequential execution speed while increasing the

    number of cores [2]. This kind of microprocessors are called multicore processors. An example of

    multicore CPU is Intels widely used Core 2 Duo processor. It has dual processor cores, each of

    which is an out-of-order, multiple instruction issue processor implementing the full X86 instruction

    set, supporting hyperthreading with two hardware threads, designed to maximize the execution speed

    of sequential programs. The second trend focuses more on the execution of parallel applications with

    as many as possible threads. This kind of processors are called many-thread processors. Most of

    the current popular GPUs are many-thread architecture. For example, with full occupancy NVidias

    GTX 970 can host 26,624 threads, executing in a large number of simple and in-order pipelines.

    The third trend is the improved combination of both multicore and many-thread architecture. This

    kind of processors are represented by most current desktop processors with integrated Graphics

    processing units. For example, the 6th generation Intels Core i7-6567U processor has dual-core

    CPU with integrated Iris graphics 550 GPU, which has 72 execution units [3]. AMDs A8-3850

    fusion processor has four x86-64 CPU cores integrated together with a Radeon HD6550D Radeon

    GPU, which has 5 SIMD engines (16-wide) and a total of 400 streaming processors [4].

    With its long evolving history, the design philosophy of a CPU is to minimize the execution

    1

  • CHAPTER 1. INTRODUCTION

    latency of a single thread. Large on-chip caches are integrated to store frequently accessed data and

    improve some of the long-latency memory accesses, providing short-latency cache access. There

    is also prediction logic, such as branch prediction and data prefetching designed to minimize the

    effective latency of operations at the cost of increased chip area and power. With all these hardware

    logic components, the CPU greatly reduces the execution latency of each individual thread. However,

    the large cache memory, low-latency arithmetic units, and sophisticated prediction logic consume

    chip area and power that could be otherwise used to provide more arithmetic execution units and

    memory access channels. This design style inside CPUs emphasizes on minimizing the latency and

    is latency-oriented design.

    The GPUs, either standalone or integrated, on the other hand, are designed as parallel,

    throughput-oriented computing engines. The application software is expected to be organized with

    much more data parallelism. The hardware takes advantage of the large number of arithmetic

    execution units, and pipelines the execution when some of them are waiting for long-latency memory

    accesses or arithmetic operations. Only limited amount of cache memories are supplied to help

    increase the memory bandwidth requirements of these applications and facilitate the data synchro-

    nization between multiple threads that access the same memory data. This design style strives

    to maximize the total execution throughput of a large amount of data parallelism while allowing

    individual threads to take a potentially much longer time to execute.

    GPUs have been leading the race of floating-point performance since 2013. With enough

    data parallelism and proper memory arrangement, the performance gap can be more than ten times.

    These are not necessarily the application speeds, but only the raw speed the execution resources can

    potentially support. For applications that have one or a few threads, CPUs can achieve much higher

    performance than GPUs. Therefore, the heterogeneous architectures combining CPUs and GPUs

    would be the natural selection for the applications, which can execute the sequential parts on the

    CPU and numerically intensive parallel parts on GPU.

    Figure 1.1 is an high-level illustration of multi-core CPU, many-thread accelerator GPU,

    and a heterogeneous system-on-chip architecture with CPU and GPU on the same die. High-

    performance computing might emphasize single-threaded latency whereas commercial transaction

    processing might emphasize aggregate throughput. Designers began to put both of these devices

    with very different characteristics together, and expected a performance gain, leveraging properly

    workload distribution and balancing.

    Graphics processing units used to be very difficult to program since programmers had to

    use the corresponding graphics application programming interface. OpenGL and Direct3D are the

    2

  • CHAPTER 1. INTRODUCTION

    S

    Homogeneous multi-core CPU Homogeneous multi-core CPU Homogeneous multi-core GPU Heterogeneous System-on-Chip

    with CPU and GPU

    Figure 1.1: Multi-core CPU, GPU, and Heterogeneous System-on-Chip CPU and GPU. Atpresent, designers are able to make decisions among diverse architecture choices: homogeneousmulti-core with cores of various size of complexity or heterogeneous system-on-chip architectures.

    most widely used graphics API specifications. More precisely, a computation must be mapped to a

    graphical function that programs an pixel processing engine so that they can be executed on the early

    GPUs. These APIs require extensive knowledge of graphics processing and also limit the kinds of

    applications that one can actually write for early general purpose GPU programming. To quench

    the increasing demands, new GPU programming paradigms became more and more popular, such

    as CUDA [5], OpenCL [6] OpenACC [7], and C++AMP [8]. Many runtime and execution system

    are also designed to help developer to manage the heterogeneous platform with multiple computing

    devices with dramatically different characteristics.

    In this thesis, we present a cross-platform heterogeneous runtime environment, providing a

    convenient programming interface to fully utilize all possible devices in a heterogeneous system. Our

    framework incorporates flexible workload balancing schemes without compromising the users ability

    to assign tasks according to the data affinity. Our framework provides significant enhancements to

    the state-of-the-art in OpenCL programming practice in terms of workload balancing and distribution.

    Furthermore, the details of programming the specific platform are hidden from the programmer,

    enabling the programmer to focus more on high-level design of the algorithms.

    In this chapter, we present the reader with an introduction to some basic concepts of

    heterogeneous computing. This includes a very brief history of heterogeneous computing with

    CPUs and GPUs, the potential benefits that heterogeneous computing provides, and the ability of

    our runtime framework to adapt applications to heterogeneous computing platforms. Finally, we

    3

  • CHAPTER 1. INTRODUCTION

    highlight the contributions of this thesis and outline the organization of the remainder of this thesis.

    1.1 A Brief History of Heterogeneous Computing

    Over the last decade, developers have witnessed the field of computer architecture transi-

    tioning from single-core compute devices to a wide range of parallel architectures. The change in

    architecture has also produced new challenges with the underlying parallel programming paradigms.

    Existing algorithms designed to scale with single-core systems had to be redesigned to reap the

    performance benefits of new parallel architectures. Multi-core is the chosen path of the industry to

    quench the thirsty for performance, and at the same time, respecting thermal and power design limits.

    While multi-core processors have ushered in a new era of concurrency, there has also

    been work on exploiting existing parallel platforms such as GPUs. Since the early 1990s, software

    architects have explored how best to run general-purpose applications on computer graphics hardware

    (i.e., GPUs). GPUs were originally designed to execute a set of predefined functions as a graphics

    rendering pipeline. Even today, GPUs are mainly designed to calculate the color of pixels on the

    screen to support complex graphics processing functions. GPUs provide deterministic performance

    when rendering frames. In the beginning of this revolution, GPU programming was done using a

    graphics Application Programming Interface (API) such as OpenGL [9] or DirectX [10]. This model

    required general purpose application developers to have intimate knowledge of graphics hardware

    and graphics APIs. These restrictions severely impacted the implementation of many algorithms on

    GPUs.

    General purpose GPU (GPGPU) programming was not widely accepted until new GPU

    architectures unified vertex and pixel processors (first available in the R600 family from AMD and

    the G80 family from NVIDIA). New general purpose programming languages such as CUDA [5]

    and Brook+ [11] were introduced in 2006. The introduction of fully programmable hardware and

    new programming languages lifted many of the restrictions and greatly increased the interest in using

    GPU for general purpose computing. Heterogeneous platforms that include GPUs as a powerful data-

    parallel co-processor have been adopted for many scientific and engineering environments [12] [13]

    [14] [15]. On current systems, discrete GPUs are connected to the rest of the system through a PCI

    express bus. All data transfer between the CPU and GPU is limited by the speed of PCI express

    protocol.

    Recently, industry leaders have recognized that scalar processing on the CPU, combined

    with parallel processing on the GPU, could be a power model for application throughput. More

    4

  • CHAPTER 1. INTRODUCTION

    recently, the Heterogeneous System Architecture (HSA) Foundation [16] was founded in 2012 by

    many vendors. HSA has provided industry with standards to further support heterogeneity across

    systems and devices. We have also seen solutions with a CPU and a GPU on the same die, such

    as AMDs APU [4] series, INTELs Ivybridge [17] series and Qualcomms Snapdragon [18], has

    demonstrated potential power/performance savings. Current state-of-the-art supercomputers utilize a

    heterogeneous solution.

    Heterogeneous systems can be found in every domain of computing, ranging from high-

    performance computing servers to low-power embedded processors in mobile phones and tablets.

    Industry and academia are investing huge amount of effort and budget to improve every aspects of

    heterogeneous computing [19] [20] [21] [15] [22] [23].

    1.2 Heterogeneous Computing with OpenCL

    The emerging software framework for programming heterogeneous devices is the Open

    Computing Language(OpenCL) [6]. OpenCL is an open industry standard managed by the non-

    profit technology consortium Khronos Group. Support for OpenCL has been increasing from major

    companies such as Qualcomm, AMD, Intel and Imagination.

    The aim of OpenCL is to serve as a universal language for programming heterogeneous

    platforms such as GPUs, CPUs, DSPs, and FPGAs. In order to support such a wide variety of

    heterogeneous devices, some elements of the OpenCL API are necessarily low-level. As with the

    CUDA/C language [5], OpenCL does not provide support for automatic workload balancing, nor

    guarantee global data consistencyit is up to the programmer to explicitly define tasks and enqueue

    them on devices, and to move data between devices as required. Furthermore, when different

    implementations of OpenCL produced by different vendors are used, OpenCL objects from vendor

    As implementation may not run on vendor Bs hardware. Given these limitations, there still remain

    barriers to achieve straightforward heterogeneous computing.

    1.3 Task-level Parallelism across Platforms with Multiple Computing

    Devices

    Platform agnostic is a quality that is taken for granted for many existing programming

    languages such as C/C++, Java, etc. Programmers rely on compilers or run-time systems to automati-

    cally generate executables for different processing units. Until recently, there did not exist a set of

    5

  • CHAPTER 1. INTRODUCTION

    API functions that would enable the programmer to automatically exploit all computing resources

    when the characteristics of the underlying platform change (e.g., number of processing units and

    accelerators).

    To help illustrate some of the challenges with heterogeneous computing, we consider the

    OpenCL open-source implementation of OpenSURF (Open source Speeded Up Robust Feature)[24]

    to demonstrate a typical use of the OpenCL programming model. In OpenSURF, the degree of

    data-parallelism a single kernel can vary when executing on different computing devices. Execution

    dynamics are also dependent on the characteristics of the input images or video frames, such as the

    size and image complexity. Furthermore, when mapping to another platform with a different number

    of devices, we usually have to re-design the kernel binding and associated data transfers, without

    proper runtime management. Without runtime workload balancing, the additional processing units

    available on the targeted accelerator may remain idle unless the application is redesigned. Even

    with the range of parallelism present in OpenSURF, an application has no inherent ability to exploit

    the extra computing resources, and is not able to improve performance if we upgrade our hardware

    platform.

    In this thesis we present a cross-platform heterogeneous runtime environment that helps

    ameliorate many of the burdens faced when performing heterogeneous programming. New pro-

    gramming models such as OpenCL and CUDA provide the ability to dynamically initialize the

    platforms and objects, and acquire the processing capability of each device, such as the number of

    compute units, core frequency, etc.. The presented runtime environment augments this ability, and

    incorporates a central task queuing/scheduling system. This central task queuing system is based on

    the concepts of work pools and work units, and cooperates with workload balancing algorithms to

    execute applications on heterogeneous hardware platforms. Using the runtime API, programmers

    can easily develop and tune flexible workload balancing schemes across different platforms.

    In the proposed runtime environment, data-parallel kernels in an application are wrapped

    with metadata into work units. These work units are then enqueued into a work pool and assigned to

    computing devices according to a selectable workload balancing policy. A resource management

    system is seamlessly integrated in the central task-queuing system to provide for migration of kernels

    between devices and platforms. We demonstrate the utility of this class of task queuing runtime

    system by implementing selected benchmark applications from OpenCL benchmark suites. We also

    benchmark the performance trade-off by implementing real world applications such as clSURF[25],

    an OpenCL open-source implementation of OpenSURF (Open source Speeded Up Robust Feature)

    framework, and Monte Carlo Extreme in OpenCL[26], a Monte Carlo simulation for time-resolved

    6

  • CHAPTER 1. INTRODUCTION

    photon transport in 3D turbid media.

    1.4 Scope and Contribution of This Thesis

    Unlike the ubiquity of the x86 architecture and the long life cycle of CPU designs, GPUs

    often have much shorter release cycles and ever-changing ISAs and hardware features. Platforms

    incorporating GPUs as accelerators can have very different configurations in terms of processing

    capabilities and number of devices. As such, the need has arisen for a programming interface and

    runtime execution system that allows a single program to be portable across different platforms, and

    can automatically use all devices supported by an associated workload balancing scheme.

    The key contribution of this thesis is the development of a cross-platform heterogeneous

    runtime environment, which enables flexible task-level workload balancing on heterogeneous plat-

    forms with multiple computing devices. Together with the application programming interface, this

    extension layer is designed in the form of a library. Different levels of this runtime environment are

    considered. We study the following aspects of our runtime environment:

    We enable portable execution of applications across platforms. Our runtime environmentprovides a unified abstraction for all processing units, including CPU, GPU and many existing

    OpenCL devices. With this unified abstraction, tasks are able to be distributed on all devices.

    An application is portable across different platforms with a variable number of processing

    units.

    We provide APIs to expose both task-level and data-level parallelism. The program de-signer is in the best position to identify all levels of parallelism present in his/her application.

    We provide API functions and a dependency description mechanism so that the programmer

    can expose task-level parallelism. When combined with the data-level parallelism present in

    OpenCL kernels, the run-time and/or the compiler can effectively adapt any type of parallel

    machine without the modification of the source code.

    Balance task execution dynamically based on static and run-time profiling information.The optimal static mapping of task execution on the underlying platform requires a significant

    amount of analysis of all of the devices on the platform, and it is impossible for programmers

    to perform such analysis and remap whenever new hardware is used. A dynamic workload

    balancing scheme makes it possible for the same source code to obtain portable performance.

    7

  • CHAPTER 1. INTRODUCTION

    We support the management of data locality at runtime. Due to the data transfer overheadand its impact on the overall performance for OpenCL applications, data locality is an important

    issue for the portable execution of tasks. In our OpenCL support, data management is tightly

    integrated with the workload migrating decisions. The runtime layer ensures data availability

    and data coherency throughout the whole system.

    We simplify the initialization of platforms. Scientific programmers usually are not familiarwith the best way to initialize platforms across different types of OpenCL devices. With a new

    API designed for our runtime environment, we shift this burden to the underlying execution

    system, so that the programmer can focus on the development of his/her algorithms.

    1.5 Organization of This Thesis

    The rest of this thesis is organized as follows. Chapter 2 provides necessary background

    information and related work on heterogeneous computing. It also presents a summary of related work

    on previously proposed runtime environments targeting heterogeneous platforms. In Chapter 3, we

    describe the structure and components in our cross-platform heterogeneous runtime environment and

    discuss how it can facilitate more effective use of the resources present on heterogeneous platforms.

    In Chapter 4, we explore the design space by using our heterogeneous runtime environment equipped

    with different scheduling schemes when running synthetic workloads. We then demonstrate the

    true value of our proposed runtime environment by evaluating the performance of benchmark

    applications run on multiple cross-vendor heterogeneous platforms. We present a detailed analysis

    on the performance components and demonstrate the programming efficiency. In Chapter 5, we

    conclude this thesis, summarizing the major contributions embodied in this work, and describe

    potential directions for future work.

    8

  • Chapter 2

    Background and Related Work

    2.1 From Serial to Parallel

    In July 2004, Intel released a statement that their 4GHz chip, originally targeted for the

    fourth quarter, will be delayed until the first quarter of next year. Company Spokesman Howard

    High said the delay will help ensure that the company can deliver high chip quantities when the

    product is launched. Later in mid-October, in a surprising announcement, Intel officially abandoned

    their entire plans to release the 4GHz version of the processor, and moved their engineers onto other

    projects. This marks an abrupt change of 34 years of CPU-frequency scaling, where the increase in

    CPU frequency grew exponentially over time.

    Figure 2.1 illustrates a brief history of Intel processors, plotting the number of transistors

    per chip and the associated clock speed [27]. As the total number of transistors continued to climb,

    the clock speed did not keep up. The reason behind this major change in CPU development is due to

    current power and cooling challenges, more specifically power density. The power density in proces-

    sors has already exceeded that of a hot plate. Continuing to increase the frequency would require

    either new cooling technologies or new materials to relax the physical limits of what a processor can

    withstand. Processor design has hit the power wall. Our ability to improve performance automatically

    by increasing frequency of the processor is gone. To further improve application throughput, major

    silicon vendors elected to provide multi-core designs, providing higher performance within the

    constraints of thermal limits and power density thresholds.

    9

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    1980

    Quad-Core Ivy Bridge

    2010

    Dual-Core Itanium 2

    Transistors (103)

    Clock Speed (MHz)

    Power (W)

    Performance/Clock (ILP)

    Pentium 4

    Pentium

    Intel CPU Trends

    (sources: Intel, Wikipedia, K. Olukotun)

    386

    1970

    0

    100

    1975

    1000

    1

    1985

    10,000

    10

    2000

    100,000

    1,000,000

    1990

    10,000,000

    100,000,000

    1995 2005

    Figure 2.1: Intel Processors Introduction Trends

    2.2 Many-Core Architecture

    In recent years, multi-core processors become the norm. Figure 2.2 shows an example of a

    multi-core processor. A multi-core processor has two or more processing cores on a single chip, each

    core with their own level-1 cache. The common global memory is shared among different processing

    cores, while multiple tasks are executed on the multi-core processors.

    Intels TeraFlops architecture [28] was designed to demonstrate a prototype of a many-core

    processor, as shown in Figure 2.3. Developed by Intel Corporations Tera-Scale Computing Research

    Program, this research processor contains 80 tiles of cores, and can yield 1.8 teraflops at 5.6GHz.

    10

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    Shared L3 Cache

    Core 0

    CPU

    L1 Cache

    L2 Cache

    Core 1

    CPU

    L1 Cache

    L2 Cache

    Core 2

    CPU

    L1 Cache

    L2 Cache

    Core 3

    CPU

    L1 Cache

    L2 Cache

    System Memory

    Figure 2.2: Multi-core Processors with Shared Memory

    While data transfers can occur between any pair of cores, no cache coherency is enforced across cores,

    and all memory transfers are explicit. Therefore, the biggest hurdle to fully take advantage of the

    power of these 80 cores is parallel programming. As shown in Figure 2.3, another interesting point is

    that some dedicated hardware engines could be integrated with some of the cores for multimedia,

    networking, security, and other tasks.

    The Intel Xeon Phi coprocessor [29] inherited many design elements from Larrabee

    project [30], which is another high performance co-processor based on the TeraFlops architecture.

    The Intel Xeon Phi coprocessor is primarily composed of processing cores, caches, PCIe client

    logic, and a very high bandwidth, bidirectional ring interconnect, as illustrated in Figure 2.4. Intel is

    using Xeon Phi as the primary element for its family of Many Integrated Core architectures. Intel

    revealed its second generation Many Integrated Core architecture in November 2013, with codename

    Knights Landing [31]. The Knights Landing contains up to 72 cores and 36 tiles manufactured

    in 14nm technology, with each core running 4 threads. The Knight Landing chip also has a 2MB

    coherent shared cache between 2 cores in a tile, which indicates the effort to make this architecture as

    programmable as possible. The Knights Landing is ISA compatible with the Intel Xeon processors

    with support for Intels Advanced Vector Extension 512, and supports most of todays parallel

    optimizations. One interesting feature of the Knights Landing is that it can either be the main

    11

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $ HD

    Video

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    Crypto

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    DSP

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    GPU

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    PE

    $

    Physics

    Figure 2.3: Intels TeraFlops Architecture

    processor on a compute node, or as a coprocessor in a PCIe slot. Intel is exploring different

    heterogeneous computing organizations.

    Another example of a heterogeneous many core architecture is IBMs Cell processor [32].

    It includes a general purpose PowerPC core with 8 very simple SIMD coprocessors, which are

    specially designed for accelerating vector or multimedia operations. An operating system runs on

    the main core, which is called Power Processing Unit (PPU). It is functioning as a master device

    controlling the 8 coprocessors, which are called Synergistic Processing Elements (SPE). Each SPE

    is a dual issue in-order processor composed of a Synergistic Processing Unit (SPU) and a Memory

    Flow Controller (MFC). The Element Interconnect Bus (EIB) is the internal communication bus

    12

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    TD

    Core

    L2

    TD

    Core

    L2

    TD

    Core

    L2

    TD

    Core

    L2

    TD

    Core

    L2

    TD

    Core

    L2

    TD

    Core

    L2

    TD

    Core

    L2

    GDDR MC

    GDDR MC

    GDDR MC

    GDDR MC GD

    DR

    IO

    GD

    DR

    IO

    PCIe

    Client

    Logic

    PCIe IO

    GDDR5

    GDDR5

    GDDR5

    GDDR5

    GDDR5

    GDDR5

    GDDR5

    GDDR5

    GDDR5

    GDDR5

    Figure 2.4: Intels Xeon Phi Architecture

    connecting various on-chip system elements.

    The Cell processor was used as the processor for the Sonys PlayStation 3 game console

    and some high performance computing servers, such as the IBM Roadrunner supercomputer and

    Mercury System servers with Cell accelerator boards [33].

    By November 2009, IBM discontinued the development of the Cell processor. The Cell

    processor benefits from very high internal memory bandwidth, but all transfers must be explicitly

    programmed by using low-level asynchronous DMA transfers. It requires significant expertise to

    write efficient code for this architecture, especially with the limited size of the local storage on each

    SPU (256 KB). Load balancing is another challenging issue on the Cell. The application programmer

    is responsible for evenly mapping the different pieces of computation on the SPUs.

    Besides the novelty in each hardware design of these many-core processors, industry

    realized that programmability can not be overlooked anymore. When the hardware design of these

    processors reaches an unprecedented complexity, it is impossible for software designers to manage all

    the processing elements manually. Suitable programming models are desperately needed to exploit

    the computing power on these architectures.

    13

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    RAM SPU

    LS

    SPU

    LS

    SPU

    LS

    SPU

    LS

    SPU

    LS

    SPU

    LS

    SPU

    LS

    SPU

    LS

    PPU

    $

    E I B

    Figure 2.5: IBMs Cell

    2.3 Programming Paradigms for Many Core Architecture

    Given the development of a number of many-core architectures, many parallel program-

    ming models have been developed to facilitate the usage of these architectures.

    2.3.1 Pthreads

    Pthreads or Portable Operating System Interface (POSIX) Threads is a set of C program-

    ming language types, functions and variables [34]. Pthreads is implemented as a header (pthread.h)

    and a library, which creates and manages multiple threads. When using Pthreads, the programmer

    has to explicitly create and destroy threads by making use of pthread API functions.

    The Pthreads library provides mechanisms to synchronize different threads, resolve race

    conditions, avoid deadlock conditions, and protect critical sections. However, the programmer has

    the responsibility to manage threads explicitly. Therefore, it is usually very challenging to design

    a scalable multithreaded application on modern many-core architectures, especially systems with

    hundreds of cores on a single machine.

    14

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    2.3.2 OpenMP

    OpenMP is an open specification for shared memory parallelism [35] [36]. It comprises

    compiler directives, callable runtime library routines and environment variables which extend

    FORTRAN, C and C++ programs. OpenMP is portable across a shared memory architecture. The

    thread management is implicit, and the programmer has to use special directives to specify the section

    of code is to be run in parallel. The number of threads to be used is specified by the environment

    variables. OpenMP is also extended as a parallel programing model for clusters.

    OpenMP uses several constructs to support implicit synchronization, so that the the program

    is relieved from worrying about the actual synchronization mechanism.

    As with Pthreads, scalability is still an issue for OpenMP, as it is a thread-based mechanism.

    Furthermore, since OpenMP is using implicit thread management, there is no fine-grained way to do

    thread-to-processor mapping.

    2.3.3 MPI

    The Message Passing Interface (MPI) [37] provides a virtual topology, synchronization,

    and communication functionality between nodes in clusters. It is a natural candidate for accelerating

    applications in distributed systems. MPI is currently the most widely used standard for developing

    High Performance Computing (HPC) applications for distributed memory architectures. It provides

    programming interfaces for C, C++, and FORTRAN. Some of the well-known MPI implementations

    include OpenMPI [38], MVAPICH [39], MPICH [40], GridMPI [41], and LAM/MPI [42].

    Similar to Pthreads, workload partitioning and task mapping have to be done by the

    programmer, but message passing is a convenient way to express date transfer between different

    processors. MPI barriers are used to specify that synchronization is needed. The barrier operation

    blocks each process from continuing its execution until all processes have entered the barrier. A

    typical usage of barriers is to ensure that the global data has been dispersed to the appropriate

    processes.

    2.3.4 Hadoop MapReduce

    Hadoop MapReduce is a software framework for developing parallel applications easily,

    and is especially well suited for processing vast amounts of data (e.g., multi-terabyte data-sets)

    in-parallel on large clusters (thousands of nodes), and on commodity hardware, in a reliable, fault-

    tolerant manner [43]. A MapReduce job usually splits the input data-set into independent chunks

    15

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    which are processed by the map tasks in a completely parallel manner. The framework sorts the

    outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output

    of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them

    and re-executes the failed tasks.

    Typically the compute nodes and the storage nodes are the same. For example, the

    MapReduce framework and the Hadoop Distributed File System are running on the same set of nodes.

    This configuration allows the framework to effectively schedule tasks on the nodes where data is

    already present, resulting in very high aggregate bandwidth across the cluster [44].

    2.4 Computing with Graphic Processing Units

    Although the multicore architectures have made it possible for the applications to over-

    come some of the physical limits encountered with purely sequential architectures, their degree of

    parallelization is not comparable to the parallelism on graphic processing units (GPUs). Intrinsically,

    GPUs are designed for highly parallel problem. With more and more complex graphic problems,

    new architectures and APIs are created, and GPUs became more and more programmable. In this

    section, we briefly review the current state of GPU computing, and the GPUs transition from a

    hardware implementation of standard graphic APIs to become a fully programmable general purpose

    processing unit.

    2.4.1 The Emergence of Programmable Graphics Hardware

    The interactive 3D graphics applications have very different characteristics as compared

    to general-purpose applications. Specifically, interactive 3D application requires high throughput

    and exhibit substantial parallelism. Since the late 1990s custom hardware has been built to take

    advantage the native parallelism in the application. Those early custom accelerators were designed

    in the form of fixed-function pipelines based on a hardware implementation of OpenGL [9] standard

    and Microsofts DirectX programming APIs. At each stage of the pipeline, a sequence of different

    operations were implemented in hardware units for specific tasks.

    Given that the GPU was originally designed to produce visual realism of rendered images,

    a fixed-function pipeline graphics hardware has some limitations to perform efficiently. In the mean-

    time, offline rendering systems such as Pixars Renderman [45] can be used to achieve impressive

    visual results by replacing the fixed-function pipeline with a more flexible programmable pipeline. In

    16

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    the programmable pipeline, fixed-function operations are replaced by user-provided pieces of code

    called shaders. Pixel shaders, vertex shaders and geometry shaders are introduced to enable flexible

    processing at each programmable pipeline.

    Initially in early shader models, vertex and pixel shares were implemented using very

    different instruction sets. But later in 2006, OpenGLs Unified Shader Model and DirectX 10s

    Shader Model 4.0 provided consistent instruction sets across all shader types - geometry, vertex and

    pixel shaders. All three types of shaders have almost the same capabilities. For example, they can

    perform the same set of arithmetic instructions and read from texture or data buffers.

    Graphics hardware designers continued to explore the best ISA for the shader models.

    Before the unified shader model, ATIs Xenos graphics chip integrated in the Xbox 360 used unified

    shader architecture. Most shaders continued to build dedicated hardware units, even though they have

    a unified shader model. But eventually, all major GPU makers chose a Unified Shader Architecture,

    which allows a single type of processing unit to be used for all types of shaders. The Unified

    Shader Architecture decouples the type of shaders from the processing unit, and allows a dynamic

    assignment of shaders to the different processing cores. This flexibility leads to better workload

    balance, allowing hardware resources to be allocated dynamically for different types of shaders,

    based on the needs of the workload.

    Figure 2.6 is an illustration of high level block diagram of a modern GPU architecture.

    2.4.2 General Purpose GPUs

    With emergence of programmable graphics hardware, new shader languages and program-

    ming APIs have been created to facilitate the programming effort. Since DirectX 9, Microsoft has

    been using the High Level Shading Language (HLSL) [46], which supports shader construction

    with C-like syntax, types, expressions, statements and functions. Similarly, the OpenGL Shading

    Language (GLSL) [47] is the corresponding high level language targeting OpenGL shader programs.

    Nvidias Cg [48] is a collaborated effort with Microsoft. The Cg compiler outputs both DirectX and

    OpenGL shader programs. Although these shader languages are very popular across the graphics

    community, mainstream programmers feel a lack of connection between the graphics primitives in

    these shader languages and the constructs in general purpose programming languages.

    With the introduction of unified shader architectures and unified shader models, a uniform

    ISA makes it easier to design high-level languages for this workload. Some examples of these

    higher level languages include Brook [49], Scott [50], Glift [51], Nvidias CUDA [5] and the

    17

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    Shader

    Core

    Shader

    Core

    Shader

    Core

    Interconnection Network

    L2 Cache

    Global memory

    Figure 2.6: High Level Block Diagram of a GPU

    Khronos Groups OpenCL [6], which is an extension of Brook. These high-level languages hide

    the graphic primitives with programming constructs which are more familiar to general purpose

    programmers. The availability of CUDA and OpenCL, currently the two most popular languages, has

    dramatically increased the programmability of GPU hardware. As a result, GPUs have been widely

    adopted in many general purpose platforms for executing data-parallel, computationally-intensive

    workloads [52]. Many key applications possessing a high degree of data-level parallelism have been

    successfully accelerated using GPUs.

    GPUs have been included in the standard configuration for many desktop machines and

    servers. The availability of high-level languages has allowed industry to support both graphics and

    compute on the same GPU. According to the 42nd TOP500 list, GPUs are used in the No.2 and

    No.6 fastest supercomputers in the world [53]. Intel Xeon Phi processors are used in the No.1 and

    No.7 fastest supercomputers in the world. A total of fifty-three systems on the list use accelerator/co-

    processor technology. Thirty-eight of these systems use NVIDIA GPU chips, two use ATI Radeon,

    18

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    and there are now thirteen systems with Intel MIC technology (Xeon Phi).

    Application OpenCL Kernel

    OpenCL Framework

    OpenCL API OpenCL C Language

    OpenCL Runtime

    OpenCL Driver

    GPU Hardware

    The OpenCL Architecture

    Figure 2.7: OpenCL Architecture

    2.5 OpenCL

    OpenCL (Open Computing Language) is an open standard for general purpose parallel

    programming on CPUs, GPUs and other processors, giving software developers portable and efficient

    access to the computing resource on these heterogeneous processing platforms [54]. OpenCL allows

    a heterogeneous platform be viewed as a single platform with multiple computing devices. It is a

    mature framework that includes a language definition, a set of APIs, compiler libraries, and a runtime

    system to support software development. Figure 2.7 shows a high-level breakdown of the OpenCL

    architecture.

    19

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    Host

    Processing Element

    Compute Unit

    Compute Device

    Figure 2.8: An OpenCL Platform

    2.5.1 An OpenCL Platform

    An OpenCL framework adopts the concept of a platform, which has a host device with

    multiple OpenCL devices interconnected [55]. The OpenCL devices can be a CPU, GPU or any type

    of processing unit which supports the OpenCL standard. An OpenCL device can be divided into one

    or more compute units (CUs), and a CU can be further divided into one or more processing elements

    (PEs). Figure 2.8 shows how the OpenCL standard hierarchically describes a heterogeneous platform

    with multiple OpenCL devices, multiple CUs and multiple PEs.

    2.5.2 OpenCL Execution Model

    The execution model of OpenCL consists of two parts: a host program running on the host

    device, setting up data and scheduling execution on a compute device, and kernels executed on one

    or more OpenCL devices [56]. Figure 2.9 shows the OpenCL execution model.

    An OpenCL command queue is where the host interacts with an OpenCL device by queuing

    computation kernels. Each command-queue is associated with a single device. There are three types

    of commands in a command-queue:

    20

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    F

    Device 0 Device 1 Device 2 Device 3

    foo() bar() baz() qux()

    .... Host

    Program

    foo() bar() baz() qux ()

    Command Queue Context

    Kernel

    Figure 2.9: The OpenCL Execution Model

    Kernel-enqueue commands: Enqueues a kernel for execution on a device.

    Memory commands: Transfers data between the host and device memory, between memoryobjects, or maps and unmaps memory objects from the host address space.

    Synchronization commands: Explicit synchronization points that define ordering constraintsbetween commands.

    Commands communicate their status through Event objects. Successful completion is

    indicated by setting the event status to CL COMPLETE. Unsuccessful completion results in abnormal

    termination of the command which is indicated by setting the event status to a negative value. In

    this case, the command-queue associated with the abnormally terminated command and all other

    command-queues in the same context may no longer be available and their behavior is implementation

    defined.

    A command submitted to a device will not launch until prerequisites that constrain the

    order of commands have been resolved. These prerequisites have two sources. First, they may

    21

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    arise from commands submitted to a command-queue that constrain the order that commands are

    launched. For example, commands that follow a command queue barrier will not launch until all

    commands prior to the barrier are complete. The second source of prerequisites is dependencies

    between commands expressed through events. A command may include an optional list of events.

    The command will wait and not launch until all the events in the list are in the CL COMPLETE state.

    Using this mechanism, event objects define ordering constraints between commands and coordinate

    execution between the host and one or more devices [54]. In our cross-platform runtime system,

    we expand this mechanism to support dependencies between events across OpenCL devices from

    different vendors.

    A command may be submitted to a device, and yet there may be no visible side effects

    except to wait on and satisfy event dependencies. Examples include markers-, kernels executed over

    ranges of no work-items or copy operations of zero size. Such commands may pass directly from the

    ready state to the ended state.

    Command execution can be blocking or non-blocking. Consider a sequence of OpenCL

    commands. For blocking commands, the OpenCL API functions that enqueue commands do not

    return until the command has completed. Alternatively, OpenCL functions that enqueue non-

    blocking commands return immediately and require that a programmer defines dependencies between

    enqueued commands to ensure that enqueued commands are not launched before needed resources

    are available. In both cases, the actual execution of the command may occur asynchronously with

    execution of the host program.

    Multiple command-queues can be present within a single context. Multiple command-

    queues execute commands independently. Event objects visible to the host program can be used to

    define synchronization points between commands in multiple command queues. If such synchroniza-

    tion points are established between commands in multiple command-queues, an implementation must

    assure that the command-queues progress concurrently and correctly accounts for the dependencies

    established by the synchronization points.

    The core of the OpenCL execution model is defined by how the kernels execute. When a

    kernel-enqueue command submits a kernel for execution, an index space is defined. The kernel, the

    argument values associated with the arguments to the kernel, and the parameters that define the index

    space define a kernel-instance. When a kernel-instance executes on a device, the kernel function

    executes for each point in the defined index space. Each of these executing kernel functions is called

    a work-item. The work-items associated with a given kernel-instance are managed by the device in

    groups called work-groups. These work-groups define a coarse grained decomposition of the Index

    22

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    space. Work-groups are further divided into sub-groups, which provide an additional level of control

    over execution.

    work-item

    ( + , + )

    , = (0, 0)

    work-item

    ( + , + )

    , = ( 1, 0)

    work-item

    ( + , + )

    , = (0, 1)

    work-item

    ( + , + )

    , = ( 1, 1)

    work-group

    Type equation here.

    work-group size

    Type equation here.

    wo

    rk-g

    rou

    p s

    ize

    Typ

    e eq

    uation here.

    NDRange size

    Type equation here.

    ND

    Ran

    ge s

    ize

    Typ

    e eq

    uation here.

    Figure 2.10: OpenCL work-items mapping to GPU devices.

    2.5.2.1 Mapping OpenCL Work-items

    Each work-items global ID is an N-dimensional tuple. The global ID components are

    values in the range from F, to F plus the number of elements in that dimension minus one.

    If a kernel is compiled as an OpenCL 2.0 kernel [20], the size of work-groups in an

    NDRange (the local size) need not be the same for all work-groups. In this case, any single

    dimension for which the global size is not divisible by the local size will be partitioned into two

    regions. One region will have work-groups that have the same number of work items as was specified

    for that dimension by the programmer (the local size). The other region will have work-groups

    with less than the number of work items specified by the local size parameter in that dimension (the

    remainder work-groups). Work-group sizes can be non-uniform in multiple dimensions, potentially

    producing work-groups of up to 4 different sizes in a 2D range and 8 different sizes in a 3D range.

    Each work-item is assigned to a work-group and is given a local ID to represent its position

    within the work-group. A work-items local ID is an N-dimensional tuple with components in the

    range from zero to the size of the work-group in that dimension minus one.

    23

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    bar

    rier

    bar

    rier

    Workgroup 0 Workgroup n

    bar

    rier

    bar

    rier

    Workgroup 1 Workgroup n+1

    bar

    rier

    bar

    rier

    Workgroup n-1 Workgroup 2n-1

    Execution

    CPU

    Thread 0

    CPU

    Thread 1

    CPU

    Thread n-1

    Figure 2.11: OpenCL work-items mapping to CPU devices.

    Work-groups are assigned IDs similarly. The number of work-groups in each dimension

    is not directly defined but is inferred from the local and global NDRanges provided when a kernel

    instance is enqueued. A work-groups ID is an N-dimensional tuple with components in the range 0

    to the ceiling of the global size in that dimension divided by the local size in the same dimension. As

    a result, the combination of a work-group ID and the local-ID within a work-group uniquely defines

    a work-item. Each work-item is identifiable in two ways; in terms of a global index, and in terms of

    a work-group index plus a local index within a work group.

    On a CPU device, work-items are mapped by a different mechanism. An example mapping

    of OpenCL execution on a CPU is shown in Figure 2.11. In this example, one worker thread is

    created per physical CPU core when executing a kernel. Then this worker-thread, which is usually a

    CPU thread, takes a work-group from the ND-range and begins to execute its associated work-items

    one by one in sequence. If an OpenCL barrier is reached, the work-items state is stored and the

    execution of the following work-item begins. When all work-items in this work group have reached

    the barrier, execution will go back to the first work-item which stops at the barrier. It will resume the

    24

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    execution until the next synchronization point. In the absence of barriers, the first work-item will

    run to the end of the kernel before switching to the next. In both cases, the CPU will continuously

    process all the work-items until the entire work-group is executed. During the whole process, idle

    CPU threads will look for any remaining work-groups in the ND-range and begin the process them.

    2.5.2.2 Kernel Execution

    A kernel object is defined to include a function within the program object and a collection

    of arguments connecting the kernel to a set of argument values [57]. The host program enqueues a

    kernel object to the command queue, along with the NDRange and the work-group decomposition.

    These define a kernel instance. In addition, an optional set of events may be defined when the kernel

    is enqueued. The events associated with a particular kernel instance are used to constrain when the

    kernel instance is launched with respect to other commands in the queue or with respect to commands

    in other queues within the same context.

    A kernel instance is submitted to a device. For an in-order command queue, the kernel

    instances appear to launch and then execute in that same order.

    Once these conditions are met, the kernel instance is launched and the work-groups

    associated with the kernel instance are placed into a pool of ready-to-execute workgroups. The

    device schedules work-groups from the pool for execution on the compute units of the device. The

    kernel-enqueue command is complete when all work-groups associated with the kernel instance

    end their execution, updates to global memory associated with a command are visible globally, and

    the device signals successful completion by setting the event associated with the kernel-enqueue

    command to CL COMPLETE.

    While a command-queue is associated with only one device, a single device may be

    associated with multiple command-queues. A device may also be associated with command queues

    associated with different contexts within the same platform. The device will pull work-groups

    from the pool and execute them on one or several compute units in any order; possibly interleaving

    execution of work-groups from multiple commands. A conforming implementation may choose

    to serialize the work-groups so a correct algorithm cannot assume that work-groups will execute

    in parallel. There is no safe and portable way to synchronize across the independent execution of

    work-groups since they can execute in any order.

    The work-items within a single sub-group execute concurrently, but not necessarily in

    parallel (i.e., they are not guaranteed to make independent forward progress). Therefore, only

    25

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    high-level synchronization constructs (e.g. sub-group functions such as barriers) that apply to all the

    work-items in a sub-group are well defined and included in OpenCL.

    Sub-groups execute concurrently within a given work-group and with appropriate device

    support may make independent forward progress with respect to each other, with respect to host

    threads and with respect to any entities external to the OpenCL system but running on an OpenCL

    device, even in the absence of work-group barrier operations. In this situation, sub-groups are able

    to internally synchronize using barrier operations without synchronizing with each other and may

    perform operations that rely on runtime dependencies on operations other sub-groups perform.

    The work-items within a single work-group execute concurrently, but are only guaranteed

    to make independent progress in the presence of sub-groups and device support. In the absence

    of this capability, only high-level synchronization constructs (e.g., work-group functions such as

    barriers), that apply to all the work-items in a work-group, are well defined and included in OpenCL

    for synchronization within a work-group.

    2.5.2.3 Synchronization

    Synchronization across all work-items within a single work-group is carried out using a

    work-group function [58]. These functions carry out collective operations across all the work-items

    in a work-group. Available collective operations are: barrier, reduction, broadcast, prefix sum, and

    evaluation of a predicate. A work-group function must occur within a converged control flow; i.e.,

    all work-items in the work-group must encounter precisely the same work-group function. For

    example, if a work-group function occurs within a loop, the work-items must encounter the same

    work-group function in the same loop iterations. All the work-items of a work-group must execute

    the work-group function and complete reads and writes to memory before any are allowed to continue

    execution beyond the work-group function. Work-group functions that apply between work-groups

    are not provided in OpenCL since OpenCL does not define forward progress or ordering relations

    between work-groups, hence collective synchronization operations are not well defined.

    Synchronization across all work-items within a single sub-group is carried out using a

    sub-group function. These functions carry out collective operations across all the work-items in

    a sub-group. Available collective operations are: barrier, reduction, broadcast, prefix sum, and

    evaluation of a predicate. A sub-group function must occur within a converged control flow; i.e., all

    work-items in the sub-group must encounter precisely the same sub-group function. For example,

    if a work-group function occurs within a loop, the work-items must encounter the same sub-group

    26

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    function in the same loop iterations. All the work-items of a sub-group must execute the sub-group

    function and complete reads and writes to memory before any are allowed to continue execution

    beyond the sub-group function. Synchronization between sub-groups must either be performed using

    work-group functions, or through memory operations. Using memory operations for sub-group

    synchronization should be used carefully as forward progress of sub-groups relative to each other is

    only supported optionally by OpenCL implementations.

    A synchronization point between a pair of commands (A and B) assures that results of

    command A happens-before command B is launched. This requires that any updates to memory

    from command A complete and are made available to other commands before the synchronization

    point completes. Likewise, this requires that command B waits until after the synchronization point

    before loading values from global memory. The concept of a synchronization point works in a similar

    fashion for commands such as a barrier that apply to two sets of commands. All the commands prior

    to the barrier must complete and make their results available to following commands. Furthermore,

    any commands following the barrier must wait for the commands prior to the barrier before loading

    values and continuing their execution.

    2.5.3 OpenCL Memory Model

    The OpenCL memory model describes the structure, contents, and behavior of the memory

    exposed by an OpenCL platform as an OpenCL program runs [59]. The model allows a programmer

    to reason about values in memory as the host program and multiple kernel-instances execute.

    An OpenCL program defines a context that includes a host, one or more devices, command-

    queues, and memory exposed within the context. Consider the units of execution involved with such

    a program. The host program runs as one or more host threads managed by the operating system

    running on the host (the details of which are defined outside of OpenCL). There may be multiple

    devices in a single context which all have access to memory objects defined by OpenCL. On a

    single device, multiple work-groups may execute in parallel with potentially overlapping updates to

    memory. Finally, within a single work-group, multiple work-items concurrently execute, once again

    with potentially overlapping updates to memory.

    The memory regions, and their relationship to the OpenCL Platform model, are summarized

    in Figure 2.12. Local and private memories are always associated with a particular device. The

    global and constant memories, however, are shared between all devices within a given context. An

    OpenCL device may include a cache to support efficient access to these shared memories.

    27

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    Host

    Host Memory Global Memory Constant Memory

    PE

    Local Memory

    Private

    Memory

    Private

    Memory

    PE

    Local Memory

    Private

    Memory

    Private

    Memory

    Compute Unit 0 Compute Unit 1

    PE

    Local Memory

    Private

    Memory

    Private

    Memory

    PE

    Local Memory

    Private

    Memory

    Private

    Memory

    Compute Unit 0 Compute Unit 1

    Kernel A Kernel B

    PCIE

    Figure 2.12: The OpenCL Memory Hierarchy.

    To understand memory in OpenCL, it is important to appreciate the relationship between

    these named address spaces. The four named address spaces available to a device are disjoint, which

    means that they do not overlap. This is their logical relationship, however, and an implementation

    may choose to let these disjoint named address spaces share physical memory.

    Programmers often need functions callable from kernels, where the pointers manipulated

    by those functions can point to multiple named address spaces. This saves a programmer from

    the error-prone and wasteful practice of creating multiple copies of functions, one for each named

    address space. Therefore, the global, local and private address spaces belong to a single generic

    address space.

    2.6 Heterogeneous Computing

    To take full advantage of the resources on a heterogeneous platform, the programmer has

    to manage these the allocation of these resources. In this section, we introduce several projects which

    were designed or extended to support heterogeneous computing platforms. All of these runtimes

    or libraries provide higher-level software layers with convenient abstractions, which alleviates the

    programmer from the burden of managing resources on the targeted heterogeneous platform.

    28

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    C++ source

    Qilin API

    Compiler

    Code Cache

    Scheduler

    Libraries Dev.

    tools

    CPU GPU

    Application

    Qilin

    System

    Hardware

    Figure 2.13: Qilin Software Architecture

    2.6.0.1 Qilin

    Qilin [60] is a programming system recently developed for heterogeneous multiprocessors.

    Figure 2.13 shows the software architecture of Qilin. At the application level, Qilin provides an API

    to programmers for describe parallelizable operations. By explicitly expressing these computations

    through the API, the compiler does not have to extract any implicit parallelism from the serial code,

    and instead can focus on performance tuning. Similar to OpenMP, the Qilin API is built on top of

    C/C++ so that it can be easily adopted. But unlike standard OpenMP, where parallelization only

    happens on the CPU, Qilin can exploit the hardware parallelism available on both the CPU and the

    GPU.

    Beneath the API layer is the Qilin system layer, which consists of a dynamic compiler

    and its code cache, a number of libraries, a set of development tools, and a scheduler. The compiler

    dynamically translates the API calls into native machine codes. It also produces a near-optimal map-

    29

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    ping from computations to processing elements using an adaptive algorithm. To reduce compilation

    overhead, translated code is stored in the code cache so that it can be reused without recompilation,

    whenever possible. Once native machine code is available, it can be scheduled to run on the CPU

    and/or the GPU by the scheduler. Libraries include commonly used functions such as BLAS and FFT.

    Finally, debugging, visualization, and profiling tools can be provided to facilitate the development of

    Qilin programs.

    Qilin uses off-line profiling to obtain information about each task on each computing

    device. This information is then used to partition tasks and create an appropriate performance model

    for the targeted heterogeneous platform. However, the overhead to carry out the initial profiling

    phase can be prohibitively high and results may be inaccurate if computation behavior is heavily

    input dependent.

    OpenCL Application

    OpenCL Common Runtime for Linux on x86

    Platform

    Device 0 Context Device 1

    Queue 0 Program MemObj Queue 1

    Figure 2.14: The OpenCL environment with the IBM OpenCL common runtime.

    30

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    2.6.0.2 IBM OpenCL common runtime

    IBMs OpenCL common runtime [61] improves the OpenCL programming experience

    by removing the burden from the programmer of managing multiple OpenCL platforms and dupli-

    cated resources, such as contexts and memory objects. In the conventional OpenCL programming

    environment, programmers are responsible for managing the movement of memory between two

    or more contexts, when there are multiple OpenCL devices are present on the platform. In this

    case, the application is forced to have host side synchronization in order to move their memory

    objects between coordinating contexts. Equipped with the common runtime, this movement and

    synchronization is done automatically.

    In addition, the common runtime also improves the OpenCL programming experience by

    alleviating the programmer from managing cross-queue scheduling and event dependencies. By

    convention, OpenCL requires that command queue event dependencies must originate from the same

    context as that of the command queue. In a multiple context environment, this restriction forces

    programmers to manage their own cross-queue scheduling and dependencies. Again, this requires

    additional host-side synchronization in the application. With the common runtime, the handling of

    cross-queue event dependencies and scheduling are handled for the programmer.

    Finally, The common runtime improves application portability and resource usage, which

    reduces application complexity. In the conventional OpenCL environment, coordination of OpenCL

    resources is more than just an inconvenience. Managing resources comes with challenges of

    application portability, which becomes an issue when code is tuned for a particular underlying

    platform. Applications are forced to choose whether to support only one platform, potentially leaving

    compute resources unused, or adding complexity to manage resources across a range of platforms.

    Using the unifying platform provided by the IBM OpenCL common runtime, applications are more

    portable and resources can be more easily exploited.

    IBMs OpenCL common runtime is designed to improve the OpenCL programming expe-

    rience by managing multiple OpenCL platforms and duplicated resources. It minimizes application

    complexity by presenting the programming environment as a single OpenCL platform. Shared

    OpenCL resources, such as data buffers, events, and kernel programs are transparently managed

    across the installed vendor implementations. The result is simpler programming in heterogeneous

    environments. However, even equipped with this commercially-developed common runtime, many

    of the multiple contexts features, such as scheduling decisions and data synchronization, must still

    be manually performed by the programmer.

    31

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    2.6.0.3 StarPU

    StarPU [62] automatically schedules tasks across the different processing units of an

    accelerator-based machine. Applications using StarPU do not have to deal with low-level concerns

    such as data transfers or an efficient load balancing that are target system dependent. StarPU

    is a C library that provides an API to describe application data, and can asynchronously submit

    tasks that are dispatched and executed transparently over the entire machine in an efficient way.

    Providing a separation of concerns between writing efficient algorithms and mapping them on

    complex accelerator-based machines therefore makes it possible to achieve portable performance,

    tapping into the potential of both accelerators and multi-core architectures.

    An application first has to register data with StarPU. Once a piece of data has been

    registered, its state is fully described using an opaque data structure, called a handle. Programmers

    must then divide their applications into sets of possibly inter-dependent tasks. In order to obtain

    portable performance, programmers do not explicitly choose which processing units will process the

    different tasks.

    Each task is described by a structure that contains the list of handles of the data that the task

    will manipulate, the corresponding access modes (i.e. read, write, etc.), and a multi-versioned kernel

    called a codelet, which gathers the various kernel implementations available on the different types of

    processing units. The different tasks are submitted asynchronously to StarPU, which automatically

    decides where to execute them. Thanks to the data description stored in the handle data structure,

    StarPU also ensures that a coherent replicate of the different pieces of data accessed by a task are

    automatically transferred to the appropriate processing unit. If StarPU selects a CUDA device to

    execute a task, the CUDA implementation of the corresponding codelet will be provided with pointers

    to locally replicated data allocated in the memory on the GPU.

    Programmers need not worry about where the tasks are executed, nor how data replicates

    are managed for these tasks. They simply need to register data, submit tasks with their implemen-

    tations for the various processing units, and just wait for their termination, or simply rely on task

    dependencies.

    StarPU is a simple tasking API that provides numerical kernel designers with a convenient

    way to execute parallel tasks on heterogeneous platforms, and incorporates a number of different

    scheduling policies. StarPU is based on the integration of a resource management facility with a

    task execution engine. Several scientific kernels [63][64] [65][66] have been deployed on StarPU to

    utilize the computing power of heterogeneous platforms. However, StarPU is implemented in C and

    32

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    the basic schedulable units (codelets) have to be implemented multiple times if they are targeting

    multiple devices. This limits the migration of the codelets across platforms, and increases the

    programmers burden. To overcome this limitation, StarPU has initiated a recent effort to incorporate

    OpenCL [67] as the front-end.

    2.6.0.4 Maestro

    The Maestro model [68] unifies the disparate, device-specific, queues into a single, high-

    level, task queue. At runtime, Maestro queries OpenCL to obtain information about the available

    GPUs or other accelerators in a given system. Based on this information, Maestro can transfer data

    and divide work among the available devices automatically. This frees the programmer from having

    to synchronize multiple devices and keep track of device-specific information.

    Since OpenCL can execute on devices which differ radically in architecture and compu-

    tational capabilities, it is difficult to develop simple heuristics with strong performance guarantees.

    Hence, Maestros optimizations rely solely on empirical data, instead of any performance model

    or apriori knowledge. Maestros general strategy for all optimizations can be summarized by the

    following steps as show in Figure 2.15.

    This strategy is used to optimize a variety of parameters, including local work group

    size, data transfer size, and the division of work across multiple devices. However, these dynamic

    execution parameters are only one of the obstacles to true portability. Another obstacle is the choice

    of hardware-specific kernel optimizations. For instance, some kernel optimizations may result in

    excellent performance on a GPU, but reduce performance on a CPU. This remains an open problem.

    Since the solution will no doubt involve editing kernel source code, it is beyond the scope of Maestro.

    Maestro is an open source library for data orchestration on OpenCL devices. It provides

    automatic data transfer, task decomposition across multiple devices, and auto-tuning of dynamic

    execution parameters for selected problems. However, Maestro relies heavily on empirical data and

    benchmark profiling beforehand. This limits its ability to run on applications with data-dependent

    program flow and/or data dependencies.

    2.6.0.5 Symphony

    Symphony [69], previously known as MARE (Multicore Asynchronous Runtime Environ-

    ment) [70], seamlessly integrates heterogeneous execution into a concurrent task graph and removes

    the burden from the programmer of managing data transfers and explicit data copies between kernels

    33

  • CHAPTER 2. BACKGROUND AND RELATED WORK

    Estimate based on benchmarks

    Collect empirical data from execution

    Optimize based on results

    Performance continues improving?

    Final performance stratety

    No

    Yes

    Figure 2.15: Maestros Optimization Flow

    executing on different devices. At a low level, Symphony provides state-of-the-art algorithms for

    work stealing and power optimizations that can hide hardware idiosyncrasies, allowing for portable

    application development. In addition, Symphony is designed to support dynamic mapping of kernels

    to heterogeneous execution units. Moreover, expert programmers can take charge of the execution

    through a carefully designed system of attributes and di